我希望找到使用文本列的短语,所以我尝试使用搭配选项:
library(quanteda)
dataset1 <- data.frame( anumber = c(1,2,3), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source."))
cols <- textstat_collocations(dataset1 $text, size = 2:3, min_count = 30)在此之后,使用化合物作为他们的frq,尝试如下:
inputforDfm <- tokens_compound(cols)错误在tokens_compound.default(cols)中: tokens_compound()只对令牌对象起作用。
但它需要代币吗?如何才能将其插入到dfm中:
myDfm <- dataset1 %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm()发布于 2020-08-19 21:41:39
您需要标记文本,因为tokens复合需要一个tokens对象作为它的第一个参数。
library(quanteda)
## Package version: 2.1.1这里我将其更改为min_count = 2,否则在本例中不返回任何搭配,因为文本中没有出现30次或更多次!
cols <- textstat_collocations(dataset1$text, size = 2:3, min_count = 2)在复合之后,现在我们可以看到令牌之间的化合物:
toks <- tokens(dataset1$text) %>%
tokens_compound(cols)
print(toks)
## Tokens consisting of 3 documents.
## text1 :
## [1] "Lorem_Ipsum_is" "simply" "dummy_text" "of_the"
## [5] "printing" "and" "typesetting" "industry"
## [9] "." "Lorem_Ipsum" "has" "been"
## [ ... and 28 more ]
##
## text2 :
## [1] "It_has" "survived" "not" "only" "five" "centuries"
## [7] "," "but" "also" "the" "leap" "into"
## [ ... and 37 more ]
##
## text3 :
## [1] "Contrary" "to" "popular" "belief"
## [5] "," "Lorem_Ipsum_is" "not" "simply"
## [9] "random" "text" "." "It_has"
## [ ... and 63 more ]创建dfm现在以通常的方式进行,我们只需选择以下几种就可以看到这些化合物:
dfm(toks) %>%
dfm_select(pattern = "*_*")
## Document-feature matrix of: 3 documents, 5 features (33.3% sparse).
## features
## docs lorem_ipsum_is dummy_text of_the lorem_ipsum it_has
## text1 1 2 1 1 0
## text2 0 0 0 2 1
## text3 1 0 2 1 1https://stackoverflow.com/questions/63489070
复制相似问题