我试着数一数全德达多字表达的频率。我知道语料库中有几篇文章包含了这个表达式,因为当我在Python中使用're‘查找它时,它可以找到它们。然而,对全德达来说,这似乎是行不通的。有人能告诉我我做错了什么吗?
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]发布于 2020-07-14 13:13:33
您正处于正确的轨道上,但是quanteda的默认令牌器似乎将短语中的令牌分隔为四个字符:
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"出于这些原因,您应该考虑另一个令牌处理器。幸运的是,优秀的spaCy库提供了这样做的方法,并且有中文模型。使用spacyr包和quanteda,您可以在加载小型中文模型后直接从spacyr::spacy_tokenize()的输出中创建令牌。
要计算这些表达式,可以在dfm上使用tokens_select()和textstat_frequency()的组合。
library("quanteda")
## Package version: 2.1.0
txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
spacy_tokenize(txt) %>%
as.tokens() %>%
tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
tokens_select("抗美 援朝") %>%
dfm() %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 抗美 援朝 3 1 1 all发布于 2020-07-14 11:17:22
首先,对于不能使用完整的中文文本表示歉意。但以下是我的总统演讲,我冒昧地插入了你的普通话单词:
data <- "I stand here today humbled by the task before us 抗美 援朝,
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation,
as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."如果您想使用quanteda,您可以做的是计算4克(我认为您的单词由四个符号组成,因此将被视为四个单词)。
步骤1:将文本拆分为word标记:
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)第二步:计算4克并列出它们的频率列表。
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)你可以检查前十名:
fourgrams[1:10]
抗 美 援 朝 美 援 朝 have America has carried on Americans 抗 美 援
4 2 1 1
amidst gathering clouds and ancestors I thank President and cooperation he has and raging storms At
1 1 1 1
and the still waters and true to our
1 1 如果你只想知道你的目标化合物的频率:
fourgrams["抗 美 援 朝"]
抗 美 援 朝
4 或者,更简单地说,特别是如果您对单个化合物感兴趣,您可以使用来自stringr的stringr。这将立即为您提供频率计数:
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4发布于 2020-07-15 04:31:51
一般来说,用汉语或日语制作字典查找或复合标记是最好的,但字典值应该以与标记相同的方式分割。
require(quanteda)
require(stringi)
txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")
## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)
## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs mwe1 mwe2
## text1 1 1https://stackoverflow.com/questions/62892914
复制相似问题