文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从R中的数据框中提取关键词

问如何从R中的数据框中提取关键词
EN

Stack Overflow用户

提问于 2017-07-28 18:30:13

回答 3查看 2.6K关注 0票数 1

我是R中的文本挖掘新手。我想从我的数据框的列中删除停用词(即提取关键字)，并将这些关键字放入新列中。

我试着建立一个语料库，但它对我没有帮助。

df$C3是我目前所拥有的。我想添加列df$C4，但是我不能让它工作。

df <- structure(list(C3 = structure(c(3L, 4L, 1L, 7L, 6L, 9L, 5L, 8L, 
       10L, 2L), .Label = c("Are doing good", "For the help", "hello everyone", 
       "hope you all", "I Hope", "I need help", "In life", "It would work", 
       "On Text-Mining", "Thanks"), class = "factor"), C4 = structure(c(2L, 
       4L, 1L, 6L, 3L, 7L, 5L, 9L, 8L, 3L), .Label = c("doing good", 
       "everyone", "help", "hope", "Hope", "life", "Text-Mining", "Thanks", 
       "work"), class = "factor")), .Names = c("C3", "C4"), row.names = c(NA, 
       -10L), class = "data.frame")

head(df)
#               C3          C4
# 1 hello everyone    everyone
# 2   hope you all        hope
# 3 Are doing good  doing good
# 4        In life        life
# 5    I need help        help
# 6 On Text-Mining Text-Mining

corpus

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-07-28 19:28:02

此解决方案使用包dplyr和tidytext。

library(dplyr)
library(tidytext)

# subset of your dataset
dt = data.frame(C1 = c(108,20, 999, 52, 400),
                C2 = c(1,3,7, 6, 9),
                C3 = c("hello everyone","hope you all","Are doing good","in life","I need help"), stringsAsFactors = F)

# function to combine words (by pasting one next to the other)
f = function(x) { paste(x, collapse = " ") }

dt %>%
  unnest_tokens(word, C3) %>%      # split phrases into words
  filter(!word %in% stop_words$word) %>%   # keep appropriate words
  group_by(C1, C2) %>%             # for each combination of C1 and C2
  summarise(word = f(word)) %>%    # combine multiple words (if there are multiple)
  ungroup()                        # forget the grouping

# # A tibble: 2 x 3
#        C1    C2  word
#      <dbl> <dbl> <chr>
#   1    20     3  hope
#   2    52     6  life

这里的问题是，该包中内置的“停用词”会过滤掉一些您想要保留的词。因此，您必须添加一个手动步骤，指定需要包含的单词。你可以这样做：

dt %>%
  unnest_tokens(word, C3) %>%      # split phrases into words
  filter(!word %in% stop_words$word | word %in% c("everyone","doing","good")) %>%   # keep appropriate words
  group_by(C1, C2) %>%             # for each combination of C1 and C2
  summarise(word = f(word)) %>%    # combine multiple words (if there are multiple)
  ungroup()                        # forget the grouping

# # A tibble: 4 x 3
#        C1    C2       word
#      <dbl> <dbl>      <chr>
#   1    20     3       hope
#   2    52     6       life
#   3   108     1   everyone
#   4   999     7 doing good

票数 0

Stack Overflow用户

发布于 2017-07-28 19:03:40

这是我在R中做的第一件事，它可能不是最好的，但类似于：

library(stringi)

 df2 <- do.call(rbind, lapply(stop$stop, function(x){
    t <- data.frame(c1= df[,1], c2 = df[,2], words = stri_extract(df[,3], coll=x))
    t<-na.omit(t)}))

示例数据：

 df = data.frame(c1 = c(108,20,99), c2 = c(1,3,7), c3 = c("hello everyone", "hope you all", "are doing well"))

 stop = data.frame(stop = c("you", "all"))

然后，您可以使用以下命令重塑df2：

df2 = data.frame(c1 = unique(u$c1), c2 = unique(u$c2), words = paste(u$words, collapse= ','))

然后是cbind df和df2

票数 0

Stack Overflow用户

发布于 2017-07-28 19:33:11

我会使用tm-package。它有一本小字典，里面有英文停用词。您可以使用gsub()将这些停用词替换为空格

library(tm)
prep      <- tolower(paste(" ", df$C3, " "))
regex_pat <- paste(stopwords("en"), collapse = " | ")
df$C4     <- gsub(regex_pat, " ", prep)
df$C4     <- gsub(regex_pat, " ", df$C4)
#    C3               C4
# 1  hello everyone   hello everyone  
# 2    hope you all             hope  
# 3  Are doing good             good  
# 4         In life             life  
# 5     I need help        need help

你可以很容易地添加像c("hello", "othernewword", stopwords("en"))这样的新词。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45371139

复制

相似问题

问如何从R中的数据框中提取关键词
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从R中的数据框中提取关键词EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从R中的数据框中提取关键词
EN