首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >为什么ngram()函数给出了不同的bigram?

为什么ngram()函数给出了不同的bigram?
EN

Stack Overflow用户
提问于 2015-09-29 17:25:04
回答 5查看 496关注 0票数 5

我正在编写一个R脚本,并使用库(Ngram)。

假设我有一根绳子,

优质的狗粮买来的狗粮产品能找到质量好的产品,看起来像炖肉一样,味道更好拉布拉多薄荷产品更好

想要找到双克。

ngram库给我的图片如下:

“阿普锐斯产品”、“加工肉类”、“食品产品”、“食品购买品”、“优质狗产品”、“产品外观”、“像炖肉一样”、“优质产品”、“拉布拉多菲尼基产品”、“优质产品”、“优质拉布拉多产品”、“”狗粮“”闻起来更好、“味更香”、“肉味”“找到好”、“生气勃勃”、“炖菜过程”、“能狗狗”、“菲尼基阿佩里”、“产品更好”。

因为这句话里有两次“狗粮”,所以我要这两次。但我只得到一次!

那克库或其他库中是否有一种选择,使我的句子在R中有所有的双克?

EN

回答 5

Stack Overflow用户

回答已采纳

发布于 2015-09-29 17:44:05

您可以使用stylo包。提供复本:

代码语言:javascript
复制
library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)

结果:

代码语言:javascript
复制
 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"        "can dog"          "dog food"        
[10] "food product"     "product found"    "found good"       "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  
>
票数 5
EN

Stack Overflow用户

发布于 2015-09-29 17:56:28

ngram的开发版本有一个get.phrasetable方法:

代码语言:javascript
复制
devtools::install_github("wrathematics/ngram")
library(ngram)

text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

ng <- ngram(text)
head(get.phrasetable(ng))
#            ngrams freq       prop
# 1    good qualiti    2 0.07692308
# 2        dog food    2 0.07692308
# 3 appreci product    1 0.03846154
# 4    process meat    1 0.03846154
# 5    food product    1 0.03846154
# 6     food bought    1 0.03846154

此外,还可以使用print()方法并指定output == "full"。这就是:

代码语言:javascript
复制
print(ng, output = "full")

# NOTE: more output not shown...
better labrador | 1 
finicki {1} | 

dog food | 2 
product {1} | bought {1} 
# NOTE: more output not shown...
票数 6
EN

Stack Overflow用户

发布于 2015-09-29 19:47:35

你可以用RWeka。在结果中,你可以看到“狗食”和“良好品质”两次出现。

代码语言:javascript
复制
txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"


library(RWeka)
RWEKABigramTokenizer <- function(x) {
      NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
}

RWEKABigramTokenizer(txt)

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"       
 [8] "can dog"          "dog food"         "food product"     "product found"    "found good"       "good qualiti"     "qualiti product" 
[15] "product look"     "look like"        "like stew"        "stew process"     "process meat"     "meat smell"       "smell better"    
[22] "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  

或者将tm包与RWeka结合使用。

代码语言:javascript
复制
library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))

#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)

[1] "dog food"     "good qualiti"

#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/32850155

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档