首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将dfm转换为数据帧

将dfm转换为数据帧
EN

Stack Overflow用户
提问于 2020-06-12 23:17:30
回答 1查看 99关注 0票数 0

具有来自quanteda的dfm结果:

代码语言:javascript
复制
library(quanteda); 
df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE)

myDfm <- df$text %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
    tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
    dfm()

如何才能使myDfm成为一个数据帧,它将以行数和列数作为输入,而在文本列中,它将拥有dfm进程的干净文本?

预期输出示例:

代码语言:javascript
复制
data.frame(id = c(1), text = c("loving"))

我尝试过的:

代码语言:javascript
复制
convert(myDfm, to = "data.frame")
EN

回答 1

Stack Overflow用户

发布于 2020-06-13 00:07:52

有点令人费解,但下面的代码做到了这一点。

代码语言:javascript
复制
library(dplyr)
library(tidyr)
library(quanteda)

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 1 x 2
     id features
  <dbl> <chr>   
1     1 loving

代码的前两行可以用tidytext::tidy替换

现在,如果结果是一个以上的单词,您可以使用summerize将它们压缩为一行。

基于2条记录删除不需要的值的示例:

代码语言:javascript
复制
df <- data.frame(id = c(1,2), text = c("I am loving it", "I am hating it"), stringsAsFactors = FALSE)

myDfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
  dfm()

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  filter(count != 0) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 2 x 2
     id features
  <dbl> <chr>   
1     1 loving  
2     2 hating  
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62347239

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档