首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >dfm前的搭配与复合

dfm前的搭配与复合
EN

Stack Overflow用户
提问于 2020-08-19 14:20:40
回答 1查看 149关注 0票数 1

我希望找到使用文本列的短语,所以我尝试使用搭配选项:

代码语言:javascript
复制
library(quanteda)

dataset1 <- data.frame( anumber = c(1,2,3), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source."))

    cols <- textstat_collocations(dataset1 $text, size = 2:3, min_count = 30)

在此之后,使用化合物作为他们的frq,尝试如下:

代码语言:javascript
复制
inputforDfm <- tokens_compound(cols)

错误在tokens_compound.default(cols)中: tokens_compound()只对令牌对象起作用。

但它需要代币吗?如何才能将其插入到dfm中:

代码语言:javascript
复制
myDfm <- dataset1 %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm()
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-08-19 21:41:39

您需要标记文本,因为tokens复合需要一个tokens对象作为它的第一个参数。

代码语言:javascript
复制
library(quanteda)
## Package version: 2.1.1

这里我将其更改为min_count = 2,否则在本例中不返回任何搭配,因为文本中没有出现30次或更多次!

代码语言:javascript
复制
cols <- textstat_collocations(dataset1$text, size = 2:3, min_count = 2)

在复合之后,现在我们可以看到令牌之间的化合物:

代码语言:javascript
复制
toks <- tokens(dataset1$text) %>%
  tokens_compound(cols)

print(toks)
## Tokens consisting of 3 documents.
## text1 :
##  [1] "Lorem_Ipsum_is" "simply"         "dummy_text"     "of_the"        
##  [5] "printing"       "and"            "typesetting"    "industry"      
##  [9] "."              "Lorem_Ipsum"    "has"            "been"          
## [ ... and 28 more ]
## 
## text2 :
##  [1] "It_has"    "survived"  "not"       "only"      "five"      "centuries"
##  [7] ","         "but"       "also"      "the"       "leap"      "into"     
## [ ... and 37 more ]
## 
## text3 :
##  [1] "Contrary"       "to"             "popular"        "belief"        
##  [5] ","              "Lorem_Ipsum_is" "not"            "simply"        
##  [9] "random"         "text"           "."              "It_has"        
## [ ... and 63 more ]

创建dfm现在以通常的方式进行,我们只需选择以下几种就可以看到这些化合物:

代码语言:javascript
复制
dfm(toks) %>%
  dfm_select(pattern = "*_*")
## Document-feature matrix of: 3 documents, 5 features (33.3% sparse).
##        features
## docs    lorem_ipsum_is dummy_text of_the lorem_ipsum it_has
##   text1              1          2      1           1      0
##   text2              0          0      0           2      1
##   text3              1          0      2           1      1
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63489070

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档