首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何确定R中每个变量中最频繁的单词?

如何确定R中每个变量中最频繁的单词?
EN

Stack Overflow用户
提问于 2020-01-15 19:50:05
回答 1查看 315关注 0票数 1

我已经开始学习R,我目前正在熟悉一个文本挖掘包。

我已经启动了一个项目,在这个项目中,我从组聊天导出了一个聊天日志,并将Excel中的数据清理成4行。

我已经能够把所有的信息折叠成一个整体,并创建一个最常用的单词云。

编辑:对模糊提交的道歉

我所做的第一步是清理数据,使其遵循以下简单的结构

代码语言:javascript
复制
Date            Time        Sender      Message
01/01/2019      09:54:03    Person 1    Hello
01/01/2019      10:55:03    Person 2    Hello
01/01/2019      11:56:03    Person 3    Hello
01/01/2019      12:57:03    Person 4    Hello

使用tm和wordcloud包,我通过以下方法成功地将过去一年中最常见的单词组合在聊天的所有成员之间。

代码语言:javascript
复制
    library(tm)
    library(readr)

    Chat <- read.csv("ChatExport_Cleansed.csv", stringsAsFactors = FALSE)

    chat_text <- paste(Chat$Message, collapse=" ")

    chat_source <- VectorSource(chat_text)

    corpus <- Corpus(chat_source)

    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))

    dtm <- DocumentTermMatrix(corpus)
    dtm2 <- as.matrix(dtm)

    frequency <- colSums(dtm2)
    frequency <- sort(frequency, decreasing=TRUE)

words <- names(frequency)
color_pal <- viridis(n =20)
wordcloud(words[1:300], frequency[1:900], 
          random.order=TRUE, random.color=TRUE,colors = color_pal)

接下来我想探讨的是每个发件人最常用的单词,以及在特定时间(9-5小时)发送信息最多的发件人。

输出将类似于

发件人1:最频繁的单词发件人2:最频繁的单词等

我还希望看到每个发件人在上午9时至下午5时之间发送邮件的次数的输出。

我不知道如何实现这一点,是否有可能使用折叠功能将消息分解成一个大矢量给每个发送者?

感谢您事先提出的任何建议!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-15 22:15:26

从以下数据with开始:

代码语言:javascript
复制
head(df)
# A tibble: 6 x 4
  Date       Time     Sender  Message
  <date>     <chr>    <chr>   <fct>  
1 2020-01-01 00:00:00 Person1 C      
2 2020-01-01 01:00:00 Person1 C      
3 2020-01-01 02:00:00 Person1 B      
4 2020-01-01 03:00:00 Person1 B      
5 2020-01-01 04:00:00 Person1 C      
6 2020-01-01 05:00:00 Person1 E   

您可以首先通过使用Date_Time包和函数ymd_hms设置一个lubridate列来筛选特定的小时,然后使用dplyr中的filter函数只获取上午9点到下午5点之间发送的消息。

代码语言:javascript
复制
library(lubridate)
library(dplyr)
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17)

# A tibble: 18 x 5
   Date       Time     Sender  Message Date_Time          
   <date>     <chr>    <chr>   <fct>   <dttm>             
 1 2020-01-01 09:00:00 Person1 C       2020-01-01 09:00:00
 2 2020-01-01 10:00:00 Person1 E       2020-01-01 10:00:00
 3 2020-01-01 11:00:00 Person1 C       2020-01-01 11:00:00
 4 2020-01-01 12:00:00 Person1 C       2020-01-01 12:00:00
 5 2020-01-01 13:00:00 Person1 A       2020-01-01 13:00:00
 6 2020-01-01 14:00:00 Person1 D       2020-01-01 14:00:00
 7 2020-01-01 15:00:00 Person1 A       2020-01-01 15:00:00
 8 2020-01-02 16:00:00 Person1 A       2020-01-02 16:00:00
 9 2020-01-02 17:00:00 Person1 E       2020-01-02 17:00:00
10 2020-01-01 09:00:00 Person2 D       2020-01-01 09:00:00
11 2020-01-01 10:00:00 Person2 E       2020-01-01 10:00:00
12 2020-01-01 11:00:00 Person2 E       2020-01-01 11:00:00
13 2020-01-01 12:00:00 Person2 C       2020-01-01 12:00:00
14 2020-01-01 13:00:00 Person2 A       2020-01-01 13:00:00
15 2020-01-01 14:00:00 Person2 B       2020-01-01 14:00:00
16 2020-01-01 15:00:00 Person2 E       2020-01-01 15:00:00
17 2020-01-02 16:00:00 Person2 E       2020-01-02 16:00:00
18 2020-01-02 17:00:00 Person2 D       2020-01-02 17:00:00

然后,您可以group_by每个发送者和消息,以计算每条消息的频率,然后过滤为每个发送者的最大频率。

代码语言:javascript
复制
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
  group_by(Sender, Message) %>% count() %>% 
  group_by(Sender) %>%
  filter(n == max(n))

# A tibble: 3 x 3
# Groups:   Sender [2]
  Sender  Message     n
  <chr>   <fct>   <int>
1 Person1 A           3
2 Person1 C           3
3 Person2 E           4

如果您想知道每个发件人在一段时间内发送的消息数量,可以:

代码语言:javascript
复制
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
  filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
  group_by(Sender) %>% count()

# A tibble: 2 x 2
# Groups:   Sender [2]
  Sender      n
  <chr>   <int>
1 Person1     9
2 Person2     9

它能回答你的问题吗?

数据

代码语言:javascript
复制
structure(list(Date = structure(c(18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18263, 18263, 18263, 18263, 18263, 18263, 
18263, 18263, 18263, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 
18262, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 
18263), class = "Date"), Time = c("00:00:00", "01:00:00", "02:00:00", 
"03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", "08:00:00", 
"09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", 
"15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", "20:00:00", 
"21:00:00", "22:00:00", "23:00:00", "00:00:00", "00:00:00", "01:00:00", 
"02:00:00", "03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", 
"08:00:00", "09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", 
"14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", 
"20:00:00", "21:00:00", "22:00:00", "23:00:00", "00:00:00"), 
    Sender = c("Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person1", "Person1", "Person1", "Person1", 
    "Person1", "Person1", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2", "Person2", "Person2", "Person2", 
    "Person2", "Person2", "Person2"), Message = structure(c(3L, 
    3L, 2L, 2L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 3L, 3L, 1L, 4L, 1L, 
    1L, 5L, 3L, 2L, 2L, 1L, 3L, 4L, 1L, 3L, 5L, 4L, 2L, 5L, 1L, 
    1L, 2L, 3L, 4L, 5L, 5L, 3L, 1L, 2L, 5L, 5L, 4L, 5L, 2L, 1L, 
    1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor")), row.names = c(NA, 
-50L), class = c("tbl_df", "tbl", "data.frame"))
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59758639

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档