文章/答案/技术大牛

发布

社区首页 >问答首页 >用群(quanteda)解释dfm_weight(prop=‘prop’)

问用群(quanteda)解释dfm_weight(prop=‘prop’)
EN

Stack Overflow用户

提问于 2019-07-02 16:25:51

回答 1查看 390关注 0票数 2

我正在使用dfm_weight查看不同的加权选项。如果我选择了what= 'prop‘，并将textstat_frequency按location分组，那么每个组中一个单词的正确解释是什么？

假设在纽约，术语career是0.6，在波士顿，单词team是4.0，我如何解释这些数字？

    corp=corpus(df,text_field = "What are the areas that need the most improvement at our company?") %>% 
  dfm(remove_numbers=T,remove_punct=T,remove=c(toRemove,stopwords('english')),ngrams=1:2) %>%
  dfm_weight('prop') %>% 
  dfm_replace(pattern=as.character(lemma$first),replacement = as.character(lemma$X1)) %>% 
  dfm_remove(pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$")), valuetype = "regex")
freq_weight <- textstat_frequency(corp, n = 10, groups = c("location"))


ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
  geom_bar(stat='identity')+
  facet_wrap(~ group, scales = "free") +
  coord_flip() +
  scale_x_continuous(breaks = nrow(freq_weight):1,
                     labels = freq_weight$feature) +
  labs(x = NULL, y = "Relative frequency")

quanteda

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-07-02 18:55:25

正确的解释是，这是文件中最初术语比例的总和，但用组来概括。这不是一个非常自然的解释，因为它的比例之和，你不知道有多少个词的比例是基于(在绝对频率)之前，它被总结。

quanteda < 1.4不允许这样做，但是在讨论之后我们启用了它(但是让用户小心)。

library("quanteda")
#> Package version: 1.4.3
corp <- corpus(c("a b b c c", 
                 "a a b", 
                 "b b c",
                 "c c c d"),
               docvars = data.frame(grp = c(1, 1, 2, 2)))
dfmat <- dfm(corp) %>%
    dfm_weight(scheme = "prop")
dfmat
#> Document-feature matrix of: 4 documents, 4 features (43.8% sparse).
#> 4 x 4 sparse Matrix of class "dfm"
#>        features
#> docs            a         b         c    d
#>   text1 0.2000000 0.4000000 0.4000000 0   
#>   text2 0.6666667 0.3333333 0         0   
#>   text3 0         0.6666667 0.3333333 0   
#>   text4 0         0         0.7500000 0.25

现在我们可以比较有组和不带组的textstat_frequency()。(这两种说法都不太合理。)

# sum across the corpus
textstat_frequency(dfmat, groups = NULL)
#>   feature frequency rank docfreq group
#> 1       c 1.4833333    1       3   all
#> 2       b 1.4000000    2       3   all
#> 3       a 0.8666667    3       2   all
#> 4       d 0.2500000    4       1   all

# sum across groups
textstat_frequency(dfmat, groups = "grp")
#>   feature frequency rank docfreq group
#> 1       a 0.8666667    1       2     1
#> 2       b 0.7333333    2       2     1
#> 3       c 0.4000000    3       1     1
#> 4       c 1.0833333    1       2     2
#> 5       b 0.6666667    2       1     2
#> 6       d 0.2500000    3       1     2

如果您想要的是分组后的相对频率，那么您可以先对dfm进行分组，然后对其进行加权，如下所示：

dfmat2 <- dfm(corp) %>%
    dfm_group(groups = "grp") %>%
    dfm_weight(scheme = "prop")

textstat_frequency(dfmat2, groups = "grp")
#>   feature frequency rank docfreq group
#> 1       a 0.3750000    1       1     1
#> 2       b 0.3750000    1       1     1
#> 3       c 0.2500000    3       1     1
#> 4       c 0.5714286    1       1     2
#> 5       b 0.2857143    2       1     2
#> 6       d 0.1428571    3       1     2

现在，在组内频率一词之和为1.0，这使得它们的解释更加自然，因为它们是根据分组计数而不是分组比例计算的。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56856654

复制

相似问题

问用群(quanteda)解释dfm_weight(prop=‘prop’)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用群(quanteda)解释dfm_weight(prop=‘prop’)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用群(quanteda)解释dfm_weight(prop=‘prop’)
EN