文章/答案/技术大牛

发布

问基于quanteda的R文本挖掘
EN

Stack Overflow用户

提问于 2015-06-24 14:37:56

回答 1查看 3.5K关注 0票数 0

我有一个数据集(Facebook帖子)(通过netvizz)，我使用了R中的quanteda包，这是我的R代码。

# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")

# Read File
# Facebooks posts could be generated by  FB Netvizz 
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file 
fbpost <- read.csv("D:/FB-com.csv", sep=";")

# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)

# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)

一切都正常，直到：

> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
   ... indexing 2,760 documents
   ... tokenizing texts, found 77,923 total tokens
   ... cleaning the tokens, 1584 removed entirely
   ... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1",  : 
  invalid 'dimnames' given for data frame

您将如何解释错误消息？有没有解决这个问题的建议？

quanteda

text-mining

text-analysis

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-07-01 07:19:28

Quanteda0.7.2版本中有一个bug，当其中一个文档不包含任何功能时，导致dfm()在使用字典时失败。您的示例失败了，因为在清理阶段，一些Facebook发布的“文档”最终通过清理步骤删除了它们的所有功能。

这不仅在0.8.0中得到了修正，而且我们还改变了dfm()中字典的底层实现，从而大大提高了速度。( LIWC仍然是一个庞大而复杂的字典，而正则表达式仍然意味着它的使用比简单的索引令牌要慢得多。我们将进一步优化这一点。)

devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
##    ... indexing 57 documents
##    ... lowercasing
##    ... tokenizing
##    ... shaping tokens into data.table, found 134,024 total tokens
##    ... applying a dictionary consisting of 68 key entries
##    ... summing dictionary-matched features by document
##    ... indexing 68 feature types
##    ... building sparse matrix
##    ... created a 57 x 68 sparse dfm
##    ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers   Nonfl   Swear      TV  Eating   Sleep   Groom   Death  Sports  Sexual 
##       0       0       0      42      47      49      53      76      81     100

如果文档在标记和清理后没有包含任何功能，那么它也可以工作，这可能是破坏您在Facebook文本中使用的旧dfm的原因。

mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams 
##          3

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/31029582

复制

相似问题

问基于quanteda的R文本挖掘
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于quanteda的R文本挖掘EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于quanteda的R文本挖掘
EN