有着这样的数据
dataf <- data.frame(id = c(1,2,3,4), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s","Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now","There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour","a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum",""))利用dfm结构进行文本分析预处理是可能的。
myDfm <- myCorpus %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords)) %>% tokens_wordstem() %>%
dfm(verbose = FALSE) %>% dfm_trim(min_docfreq = 3, min_termfreq = 5)在文本列中是否有任何替代选项来移除停止词(source= "smart")、生成词并使trim min_docfreq = 3、min_termfreq =5,而不需要创建dfm?
发布于 2021-04-19 13:08:10
我将根据这个问题加上注释来回答这个问题,因为您似乎需要一个dgCMatrix类来完成您想要做的事情。(这就是textmineR::CreateDtm()返回的内容。)幸运的是,quanteda dfm已经是一种特殊类型的dgCMatrix。因此,它可能会按原样工作,但如果需要,它也很容易转换--只需使用as()。
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
data(nih_sample, package = "textmineR")
dfmat <- nih_sample %>%
corpus(text_field = "ABSTRACT_TEXT", docid_field = "APPLICATION_ID") %>%
tokens() %>%
tokens_ngrams(n = 1:2) %>%
dfm()
dtm2 <- as(dfmat, "dgCMatrix")现在,dtm2应该与博客文章中的dtm一样工作。(特性/列的顺序不同,但对于将输入到主题模型的矩阵来说,这并不重要。)这是一个非常清洁的过程。
请随时在这里插入额外的tokens()选项或dfm_trim()等,您需要从quanteda。
https://stackoverflow.com/questions/67160171
复制相似问题