具有来自quanteda的dfm结果:
library(quanteda);
df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE)
myDfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
dfm()如何才能使myDfm成为一个数据帧,它将以行数和列数作为输入,而在文本列中,它将拥有dfm进程的干净文本?
预期输出示例:
data.frame(id = c(1), text = c("loving"))我尝试过的:
convert(myDfm, to = "data.frame")发布于 2020-06-13 00:07:52
有点令人费解,但下面的代码做到了这一点。
library(dplyr)
library(tidyr)
library(quanteda)
out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count") %>%
mutate(id = as.integer(gsub("[a-z]", "", document))) %>%
inner_join(df) %>% # joins on id
select(id, features) # select only the id and features column
Joining, by = "id"
# A tibble: 1 x 2
id features
<dbl> <chr>
1 1 loving代码的前两行可以用tidytext::tidy替换
现在,如果结果是一个以上的单词,您可以使用summerize将它们压缩为一行。
基于2条记录删除不需要的值的示例:
df <- data.frame(id = c(1,2), text = c("I am loving it", "I am hating it"), stringsAsFactors = FALSE)
myDfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
dfm()
out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count") %>%
mutate(id = as.integer(gsub("[a-z]", "", document))) %>%
filter(count != 0) %>%
inner_join(df) %>% # joins on id
select(id, features) # select only the id and features column
Joining, by = "id"
# A tibble: 2 x 2
id features
<dbl> <chr>
1 1 loving
2 2 hating https://stackoverflow.com/questions/62347239
复制相似问题