我正在使用中的'tm‘包创建一个语料库。我已经完成了大部分的预处理步骤。剩下的事情是删除过于常见的单词(超过80%的文档中出现的术语)。有人能帮我吗?
dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, otherWords1)
dsc <- tm_map(dsc, removeWords, otherWords2)
dsc <- tm_map(dsc, removeWords, otherWords3)
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf,
stopwords = FALSE))
dtm = removeSparseTerms(dtm, 0.99)
# ^- Removes overly rare words (occur in less than 2% of the documents)发布于 2014-09-18 14:10:26
如果你创建了一个removeCommonTerms函数
removeCommonTerms <- function (x, pct)
{
stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")),
is.numeric(pct), pct > 0, pct < 1)
m <- if (inherits(x, "DocumentTermMatrix"))
t(x)
else x
t <- table(m$i) < m$ncol * (pct)
termIndex <- as.numeric(names(t[t]))
if (inherits(x, "DocumentTermMatrix"))
x[, termIndex]
else x[termIndex, ]
}然后,如果您想要删除文档的>=80%中的术语,您可以这样做
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm
# <<DocumentTermMatrix (documents: 20, terms: 1266)>>
# Non-/sparse entries: 2255/23065
# Sparsity : 91%
# Maximal term length: 17
# Weighting : term frequency (tf)
removeCommonTerms(dtm ,.8)
# <<DocumentTermMatrix (documents: 20, terms: 1259)>>
# Non-/sparse entries: 2129/23051
# Sparsity : 92%
# Maximal term length: 17
# Weighting : term frequency (tf)发布于 2015-03-30 07:35:46
如果您要使用DocumentTermMatrix,那么另一种方法是使用bounds$global控制选项。例如:
ndocs <- length(dcs)
# ignore overly sparse terms (appearing in less than 1% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than 80% of the documents)
maxDocFreq <- ndocs * 0.8
dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))https://stackoverflow.com/questions/25905144
复制相似问题