文章/答案/技术大牛

发布

社区首页 >问答首页 >R:使用grep和tm包的部分匹配字典术语

问R:使用grep和tm包的部分匹配字典术语
EN

Stack Overflow用户

提问于 2016-05-06 15:18:20

回答 1查看 876关注 0票数 1

嗨:我有一本别人写的否定词词典。我不知道他们是怎么做的，但看起来他们用的不是波特·斯特默。字典中有一个通配符(*)，我认为它应该能使词干发生。但是我不知道如何在R上下文中使用grep()或tm包，所以我去掉了它，希望找到一种方法来实现grep的部分匹配。原来的字典是这样的

#load libraries
library(tm)
#sample dictionary terms for polarize and outlaw
negative<-c('polariz*', 'outlaw*')
#strip out wildcard
negative<-gsub('*', '', negative)
#test corpus
test<-c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')
#Here is how R's porter stemmer stems the text
stemDocument(test)

所以，如果我用R的词干器对我的语料库进行词干，像“不法分子”这样的词就会在字典中找到，但它不会匹配像“极化”之类的术语，因为它们的词根与字典中的词根不同。

因此，我想要的是某种方式让tm包只匹配每个单词的确切部分。因此，在不堵塞我的文件的情况下，我希望它能够识别“非法”和“不法分子”一词中的“不法分子”，并在“极化”、“两极分化”和“两极分化”中识别出“polariz”。这个是可能的吗？

#Define corpus
test.corp<-Corpus(VectorSource(test))  
#make Document Term Matrix
dtm<-documentTermMatrix(test.corp, control=list(dictionary=negative))
#inspect
inspect(dtm)

dictionary

text-mining

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-05-10 15:45:07

我还没有看到任何tm答案，所以这里有一个使用quanteda包作为替代方案。它允许您在字典条目中使用“glob(https://en.wikipedia.org/wiki/Glob_(programming%29))”通配符值，这是quanteda的字典函数的默认valuetype。(见?dictionary.)使用这种方法，您不需要阻止您的文本。

library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.6.2’

# create a quanteda dictionary, essentially a named list
negative <- dictionary(list(polariz = 'polariz*', outlaw = 'outlaw*'))
negative
## Dictionary object with 2 key entries.
##  - polariz: polariz*
##  - outlaw: outlaw*

test <- c('polarize', 'polarizing', 'polarized', 'polarizes', 'outlaw', 'outlawed', 'outlaws')

dfm(test, dictionary = negative, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 7 documents, 2 features.
## 7 x 2 sparse Matrix of class "dfmSparse"
##        features
## docs    polariz outlaw
##   text1       1      0
##   text3       1      0
##   text2       1      0
##   text4       1      0
##   text5       0      1
##   text6       0      1
##   text7       0      1

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37075952

复制

相似问题

问R:使用grep和tm包的部分匹配字典术语
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R:使用grep和tm包的部分匹配字典术语EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R:使用grep和tm包的部分匹配字典术语
EN