首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >应用tm方法"stemCompletion“时一个变量的多个结果

应用tm方法"stemCompletion“时一个变量的多个结果
EN

Stack Overflow用户
提问于 2014-10-05 16:23:19
回答 1查看 1.4K关注 0票数 3

我有一个包含三个变量(ID,标题,摘要)的15个观察的日志数据的语料库。使用read,我从一个.csv文件中读取数据(每观察一行)。在执行一些文本挖掘操作时,我在使用stemCompletion方法时遇到了一些麻烦。在应用stemCompletion之后,我观察到.csv的每一行都提供了三次结果。所有其他tm方法(例如stemDocument)只产生一个结果。我想知道为什么会发生这种事,我怎么能解决这个问题

我使用了以下代码:

代码语言:javascript
复制
data.corpus <- Corpus(DataframeSource(data))  
data.corpuscopy <- data.corpus
data.corpus <- tm_map(data.corpus, stemDocument)
data.corpus <- tm_map(data.corpus, stemCompletion, dictionary=data.corpuscopy) 

应用stemDocument后的单一结果为:

代码语言:javascript
复制
"> data.corpus[[1]]

physic environ   sourc  innov investig  attribut  innov space
          investig  physic space intersect  innov  innov     relev attribut  physic space   innov        reflect  chang natur  innov  technolog advanc  servic  mean chang  argu   develop  innov space similar embodi  divers set  valu   collabor open  sustain use  literatur review interview  benchmark    examin  relationship  physic environ  innov         literatur review   interview underlin innov   communic  human centr process   result five attribut  innov space  present collabor enabl modifi smart attract   reflect       provid perspect   challeng    support innov creation  develop physic space   add   conceptu develop  innov space  outlin physic space   innov servic"

使用stemCompletion后,结果出现了三次:

代码语言:javascript
复制
"$`1`
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service"

下面是一个示例,作为一个可复制的示例:

包含三个变量的三个观察的.csv文件:

代码语言:javascript
复制
ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations

下面是我使用过的词干方法

代码语言:javascript
复制
data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]

corpus <- tm_map(corpus, stemCompletion, dictionary=corpuscopy)
inspect(corpus[1:3])

在我看来,这取决于.csv中使用的变量的数量,但我不知道为什么。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-11-02 05:55:36

stemCompletion函数似乎有些奇怪。在stemCompletion版本0.6中如何使用tm并不明显。有一个很好的解决方法,here,我已经使用了这个答案。

首先,创建您拥有的CSV文件:

代码语言:javascript
复制
dat <- read.csv2( text = 
                  "ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")

write.csv2(dat, "Test.csv", row.names = FALSE)

读它,转换成一个语料库,并阻止单词:

代码语言:javascript
复制
data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)

看看它是否起作用了:

代码语言:javascript
复制
inspect(corpus)

<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
1
Below is the first titl
Innovat and Knowledg Manag

[[2]]
<<PlainTextDocument (metadata: 7)>>
2
And now the second Titl
Organiz Perform and Learn are veri import

[[3]]
<<PlainTextDocument (metadata: 7)>>
3
The third titl
Knowledg play an import rule in organ

下面是让stemCompletion工作的好方法:

代码语言:javascript
复制
stemCompletion_mod <- function(x,dict=corpuscopy) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

检查输出,以确定茎是否完成,是否正常:

代码语言:javascript
复制
lapply(corpus, stemCompletion_mod)

[[1]]
<<PlainTextDocument (metadata: 7)>>
1 Below is the first title Innovation and Knowledge Management

[[2]]
<<PlainTextDocument (metadata: 7)>>
2 And now the second Title Organizational Performance and Learning are NA important

[[3]]
<<PlainTextDocument (metadata: 7)>>
3 The third title Knowledge plays an important rule in organizations

成功!

票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26204656

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档