文章/答案/技术大牛

发布

社区首页 >问答首页 >Snowball Stemmer只干最后一个词

问Snowball Stemmer只干最后一个词
EN

Stack Overflow用户

提问于 2011-09-01 05:12:29

回答 2查看 5.4K关注 0票数 7

我想使用R中的tm包对纯文本文档语料库中的文档进行词干处理。当我将SnowballStemmer函数应用于语料库中的所有文档时，只对每个文档的最后一个单词进行词干处理。

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem

我认为这与文档被读入语料库的方式有关。要用一些简单的例子来说明这一点：

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies

stemming

回答 2

Stack Overflow用户

发布于 2014-08-22 13:42:49

加载所需的库

library(tm)
library(Snowball)

创建向量

vec<-c("running runner runs","happyness happies")

从向量创建语料库

vec<-Corpus(VectorSource(vec))

非常重要的事情是检查我们的语料库的类别，并保存它，因为我们需要一个R函数能够理解的标准语料库

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

这可能会告诉您纯文本文档

因此，现在我们修改有问题的stemDocument函数。首先，我们将纯文本转换为字符，然后拆分文本，应用stemDocument，它现在工作得很好，然后将其粘贴在一起。最重要的是，我们将输出重新转换为tm包提供的PlainTextDocument。

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

现在我们可以在我们的语料库上使用标准tm_map了

vec1 = tm_map(vec, stemDocumentfix)

结果是

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

你需要记住的最重要的事情是始终在语料库中预留文档的类别。我希望这是一个使用2个库中的函数来解决你的问题的简单解决方案。

票数 4

Stack Overflow用户

发布于 2011-09-05 06:19:30

我看到的问题是，wordStem接受一个单词向量，但是 plainTextReader假设在它读取的文档中，每个单词都在它自己的行上。换句话说，这会使plainTextReader感到困惑，因为文档中最终会有3个“word

From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes

相反，文档应该是

From
ancient
grudge
break
to
new
mutiny
where 
civil
...etc...

还要注意的是，标点符号也会让wordStem感到困惑，所以你也必须把它们去掉。

另一种不修改实际文档的方法是定义一个函数，该函数将进行分隔并删除出现在单词之前或之后的非字母数字。下面是一个简单的例子：

wordStem2 <- function(x) {
    mywords <- unlist(strsplit(x, " "))
    mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
    mycleanwords <- mycleanwords[mycleanwords != ""]
    wordStem(mycleanwords)
}

corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));

现在只需使用corpB作为您常用的语料库。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/7263478

复制

相似问题

问Snowball Stemmer只干最后一个词
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Snowball Stemmer只干最后一个词EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Snowball Stemmer只干最后一个词
EN