使用text2vec包,我创建了一个词汇表。
vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) vocab看上去像这样
> vocab
Number of docs: 120
0 stopwords: ...
ngram_min = 2; ngram_max = 2
Vocabulary:
terms terms_counts doc_counts
1: knight_severely 1 1
2: movie_expect 1 1
3: recommend_watching 1 1
4: nuke_entire 1 1
5: sense_keeping 1 1
---
14467: stand_idly 1 1
14468: officer_loyalty 1 1
14469: willingness_die 1 1
14470: fight_bane 3 3
14471: bane_beginning 1 1如何检查列terms_counts的范围?我需要这个,因为它将有助于我修剪,这是我的下一步。
pruned_vocab = prune_vocabulary(vocab, term_count_min = <BLANK>)下面的代码是可复制的
library(text2vec)
text <- c(" huge fan superhero movies expectations batman begins viewing christopher
nolan production pleasantly shocked huge expectations dark knight christopher
nolan blew expectations dust happen film dark knight rises simply big expectations
blown production true cinematic experience behold movie exceeded expectations terms
action entertainment",
"christopher nolan outdone morning tired awake set film films genuine emotional
eartbeat felt flaw nolan films vision emotion hollow bought felt hero villain
alike christian bale typically brilliant batman felt bruce wayne heavily embraced
final installment bale added emotional depth character plot point astray dark knight")
it_0 = itoken( text,
tokenizer = word_tokenizer,
progressbar = T)
vocab = create_vocabulary(it_0, ngram = c(2L, 2L))
vocab发布于 2016-11-26 07:17:00
试试range(vocab$vocab$terms_counts)
发布于 2016-11-26 07:22:24
vocab是一些元信息(文档数量、ngram大小等)和主data.frame/data.table的列表,其中包含单词计数和每个字计数的文档。
如前所述,vocab$vocab是您所需要的(带有计数的data.table)。
您可以通过调用str(vocab)找到内部结构
List of 5
$ vocab :Classes ‘data.table’ and 'data.frame': 82 obs. of 3 variables:
..$ terms : chr [1:82] "plot_point" "depth_character" "emotional_depth" "bale_added" ...
..$ terms_counts: int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
..$ doc_counts : int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, ".internal.selfref")=<externalptr>
$ ngram : Named int [1:2] 2 2
..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
$ document_count: int 2
$ stopwords : chr(0)
$ sep_ngram : chr "_"
- attr(*, "class")= chr "text2vec_vocabulary"https://stackoverflow.com/questions/40815643
复制相似问题