首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何从语料库中提取特定文本?

如何从语料库中提取特定文本?
EN

Stack Overflow用户
提问于 2019-10-31 07:27:12
回答 1查看 753关注 0票数 1

我有一个有213份文件的语料库,它们的长度各不相同。我的目的是从每一份文件中提取一份具体的文本,其中提到“财政政策”。使我的尝试更加复杂的是,我想要提取的文本在文本和文本之间是不一样的。最初经常出现的唯一关键词是财政政策或财政政策,但仅此而已。

让我们举一个例子:

代码语言:javascript
复制
df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"))

cp <- corpus (df)

最后的目的是得到这样的语料库:

代码语言:javascript
复制
df <- data.frame(Text = c("As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future.", "Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes.", "As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries.", "Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes."))

cp <- corpus(df)

请注意,我会很高兴,即使我只是得到一点兴趣和“更多的文本”,我不想。我可以简单地把它分类。不过,我还是没办法到那里。到目前为止,我已经尝试使用corpus_segment,但没有成功,也没有成功地使用数据帧。

有人能帮我吗?

非常感谢!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-10-31 08:11:16

基本R解决方案不需要语料库功能:

代码语言:javascript
复制
trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

针对进一步的问题--找到索引并使用它的子集数据:

代码语言:javascript
复制
# Return vector of sentences containing pattern: 

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

# Store the matched text as a vector: 

matched_text <- trimws(grep("fiscal .*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

#Get the index of the dataframe for each element:

matched_text_idx <- sapply(matched_text, function(x){which(grepl(x, df$Text))})

# If you want to subset the dataframe to contain only the elements which contain pattern: 

df$Text[(which(grepl("fiscal polic.*", df$Text)))]

数据:

代码语言:javascript
复制
    df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"), stringsAsFactors = FALSE)
票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58638592

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档