我正在编写一个R脚本,并使用库(Ngram)。
假设我有一根绳子,
优质的狗粮买来的狗粮产品能找到质量好的产品,看起来像炖肉一样,味道更好拉布拉多薄荷产品更好
想要找到双克。
ngram库给我的图片如下:
“阿普锐斯产品”、“加工肉类”、“食品产品”、“食品购买品”、“优质狗产品”、“产品外观”、“像炖肉一样”、“优质产品”、“拉布拉多菲尼基产品”、“优质产品”、“优质拉布拉多产品”、“”狗粮“”闻起来更好、“味更香”、“肉味”“找到好”、“生气勃勃”、“炖菜过程”、“能狗狗”、“菲尼基阿佩里”、“产品更好”。
因为这句话里有两次“狗粮”,所以我要这两次。但我只得到一次!
那克库或其他库中是否有一种选择,使我的句子在R中有所有的双克?
发布于 2015-09-29 17:44:05
您可以使用stylo包。提供复本:
library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)结果:
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can" "can dog" "dog food"
[10] "food product" "product found" "found good" "good qualiti" "qualiti product" "product look" "look like" "like stew" "stew process"
[19] "process meat" "meat smell" "smell better" "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better"
>发布于 2015-09-29 17:56:28
ngram的开发版本有一个get.phrasetable方法:
devtools::install_github("wrathematics/ngram")
library(ngram)
text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ng <- ngram(text)
head(get.phrasetable(ng))
# ngrams freq prop
# 1 good qualiti 2 0.07692308
# 2 dog food 2 0.07692308
# 3 appreci product 1 0.03846154
# 4 process meat 1 0.03846154
# 5 food product 1 0.03846154
# 6 food bought 1 0.03846154此外,还可以使用print()方法并指定output == "full"。这就是:
print(ng, output = "full")
# NOTE: more output not shown...
better labrador | 1
finicki {1} |
dog food | 2
product {1} | bought {1}
# NOTE: more output not shown...发布于 2015-09-29 19:47:35
你可以用RWeka。在结果中,你可以看到“狗食”和“良好品质”两次出现。
txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
library(RWeka)
RWEKABigramTokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
RWEKABigramTokenizer(txt)
[1] "good qualiti" "qualiti dog" "dog food" "food bought" "bought sever" "sever vital" "vital can"
[8] "can dog" "dog food" "food product" "product found" "found good" "good qualiti" "qualiti product"
[15] "product look" "look like" "like stew" "stew process" "process meat" "meat smell" "smell better"
[22] "better labrador" "labrador finicki" "finicki appreci" "appreci product" "product better" 或者将tm包与RWeka结合使用。
library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))
#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)
[1] "dog food" "good qualiti"
#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)https://stackoverflow.com/questions/32850155
复制相似问题