首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何计算全德达多字表达的频率?

如何计算全德达多字表达的频率?
EN

Stack Overflow用户
提问于 2020-07-14 10:16:49
回答 3查看 548关注 0票数 0

我试着数一数全德达多字表达的频率。我知道语料库中有几篇文章包含了这个表达式,因为当我在Python中使用're‘查找它时,它可以找到它们。然而,对全德达来说,这似乎是行不通的。有人能告诉我我做错了什么吗?

代码语言:javascript
复制
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]
EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2020-07-14 13:13:33

您正处于正确的轨道上,但是quanteda的默认令牌器似乎将短语中的令牌分隔为四个字符:

代码语言:javascript
复制
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"

出于这些原因,您应该考虑另一个令牌处理器。幸运的是,优秀的spaCy库提供了这样做的方法,并且有中文模型。使用spacyr包和quanteda,您可以在加载小型中文模型后直接从spacyr::spacy_tokenize()的输出中创建令牌。

要计算这些表达式,可以在dfm上使用tokens_select()textstat_frequency()的组合。

代码语言:javascript
复制
library("quanteda")
## Package version: 2.1.0

txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath. 
The words have been spoken during rising tides of prosperity 
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments, 
America has carried on not simply because of the skill or vision of those in high office, 
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers, 
and true to our founding documents."

library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")

spacy_tokenize(txt) %>%
  as.tokens() %>%
  tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
  tokens_select("抗美 援朝") %>%
  dfm() %>%
  textstat_frequency()
##     feature frequency rank docfreq group
## 1 抗美 援朝         3    1       1   all
票数 0
EN

Stack Overflow用户

发布于 2020-07-14 11:17:22

首先,对于不能使用完整的中文文本表示歉意。但以下是我的总统演讲,我冒昧地插入了你的普通话单词:

代码语言:javascript
复制
data <- "I stand here today humbled by the task before us 抗美 援朝, 
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. 
I thank President Bush for his service to our nation, 
as well as the generosity and cooperation he has shown throughout this transition.

Forty-four Americans 抗美 援朝 have now taken the presidential oath. 
The words have been spoken during rising tides of prosperity 
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments, 
America has carried on not simply because of the skill or vision of those in high office, 
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers, 
and true to our founding documents."

如果您想使用quanteda,您可以做的是计算4克(我认为您的单词由四个符号组成,因此将被视为四个单词)。

步骤1:将文本拆分为word标记:

代码语言:javascript
复制
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)

第二步:计算4克并列出它们的频率列表。

代码语言:javascript
复制
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)

你可以检查前十名:

代码语言:javascript
复制
fourgrams[1:10]

                抗 美 援 朝               美 援 朝 have      America has carried on          Americans 抗 美 援 
                          4                           2                           1                           1 
amidst gathering clouds and ancestors I thank President      and cooperation he has        and raging storms At 
                          1                           1                           1                           1 
       and the still waters             and true to our 
                          1                           1 

如果你只想知道你的目标化合物的频率:

代码语言:javascript
复制
fourgrams["抗 美 援 朝"]
抗 美 援 朝 
         4 

或者,更简单地说,特别是如果您对单个化合物感兴趣,您可以使用来自stringrstringr。这将立即为您提供频率计数:

代码语言:javascript
复制
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4
票数 2
EN

Stack Overflow用户

发布于 2020-07-15 04:31:51

一般来说,用汉语或日语制作字典查找或复合标记是最好的,但字典值应该以与标记相同的方式分割。

代码语言:javascript
复制
require(quanteda)
require(stringi)

txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")

## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)

## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    mwe1 mwe2
##   text1    1    1
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62892914

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档