我已经开始学习R,我目前正在熟悉一个文本挖掘包。
我已经启动了一个项目,在这个项目中,我从组聊天导出了一个聊天日志,并将Excel中的数据清理成4行。
我已经能够把所有的信息折叠成一个整体,并创建一个最常用的单词云。
编辑:对模糊提交的道歉
我所做的第一步是清理数据,使其遵循以下简单的结构
Date Time Sender Message
01/01/2019 09:54:03 Person 1 Hello
01/01/2019 10:55:03 Person 2 Hello
01/01/2019 11:56:03 Person 3 Hello
01/01/2019 12:57:03 Person 4 Hello使用tm和wordcloud包,我通过以下方法成功地将过去一年中最常见的单词组合在聊天的所有成员之间。
library(tm)
library(readr)
Chat <- read.csv("ChatExport_Cleansed.csv", stringsAsFactors = FALSE)
chat_text <- paste(Chat$Message, collapse=" ")
chat_source <- VectorSource(chat_text)
corpus <- Corpus(chat_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
words <- names(frequency)
color_pal <- viridis(n =20)
wordcloud(words[1:300], frequency[1:900],
random.order=TRUE, random.color=TRUE,colors = color_pal)接下来我想探讨的是每个发件人最常用的单词,以及在特定时间(9-5小时)发送信息最多的发件人。
输出将类似于
发件人1:最频繁的单词发件人2:最频繁的单词等
我还希望看到每个发件人在上午9时至下午5时之间发送邮件的次数的输出。
我不知道如何实现这一点,是否有可能使用折叠功能将消息分解成一个大矢量给每个发送者?
感谢您事先提出的任何建议!
发布于 2020-01-15 22:15:26
从以下数据with开始:
head(df)
# A tibble: 6 x 4
Date Time Sender Message
<date> <chr> <chr> <fct>
1 2020-01-01 00:00:00 Person1 C
2 2020-01-01 01:00:00 Person1 C
3 2020-01-01 02:00:00 Person1 B
4 2020-01-01 03:00:00 Person1 B
5 2020-01-01 04:00:00 Person1 C
6 2020-01-01 05:00:00 Person1 E 您可以首先通过使用Date_Time包和函数ymd_hms设置一个lubridate列来筛选特定的小时,然后使用dplyr中的filter函数只获取上午9点到下午5点之间发送的消息。
library(lubridate)
library(dplyr)
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17)
# A tibble: 18 x 5
Date Time Sender Message Date_Time
<date> <chr> <chr> <fct> <dttm>
1 2020-01-01 09:00:00 Person1 C 2020-01-01 09:00:00
2 2020-01-01 10:00:00 Person1 E 2020-01-01 10:00:00
3 2020-01-01 11:00:00 Person1 C 2020-01-01 11:00:00
4 2020-01-01 12:00:00 Person1 C 2020-01-01 12:00:00
5 2020-01-01 13:00:00 Person1 A 2020-01-01 13:00:00
6 2020-01-01 14:00:00 Person1 D 2020-01-01 14:00:00
7 2020-01-01 15:00:00 Person1 A 2020-01-01 15:00:00
8 2020-01-02 16:00:00 Person1 A 2020-01-02 16:00:00
9 2020-01-02 17:00:00 Person1 E 2020-01-02 17:00:00
10 2020-01-01 09:00:00 Person2 D 2020-01-01 09:00:00
11 2020-01-01 10:00:00 Person2 E 2020-01-01 10:00:00
12 2020-01-01 11:00:00 Person2 E 2020-01-01 11:00:00
13 2020-01-01 12:00:00 Person2 C 2020-01-01 12:00:00
14 2020-01-01 13:00:00 Person2 A 2020-01-01 13:00:00
15 2020-01-01 14:00:00 Person2 B 2020-01-01 14:00:00
16 2020-01-01 15:00:00 Person2 E 2020-01-01 15:00:00
17 2020-01-02 16:00:00 Person2 E 2020-01-02 16:00:00
18 2020-01-02 17:00:00 Person2 D 2020-01-02 17:00:00然后,您可以group_by每个发送者和消息,以计算每条消息的频率,然后过滤为每个发送者的最大频率。
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
group_by(Sender, Message) %>% count() %>%
group_by(Sender) %>%
filter(n == max(n))
# A tibble: 3 x 3
# Groups: Sender [2]
Sender Message n
<chr> <fct> <int>
1 Person1 A 3
2 Person1 C 3
3 Person2 E 4如果您想知道每个发件人在一段时间内发送的消息数量,可以:
df %>% mutate(Date_Time = ymd_hms(paste(Date, Time))) %>%
filter(hour(Date_Time) >= 9 & hour(Date_Time) <= 17) %>%
group_by(Sender) %>% count()
# A tibble: 2 x 2
# Groups: Sender [2]
Sender n
<chr> <int>
1 Person1 9
2 Person2 9它能回答你的问题吗?
数据
structure(list(Date = structure(c(18262, 18262, 18262, 18262,
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18262, 18262, 18263, 18263, 18263, 18263, 18263, 18263,
18263, 18263, 18263, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262, 18262,
18262, 18263, 18263, 18263, 18263, 18263, 18263, 18263, 18263,
18263), class = "Date"), Time = c("00:00:00", "01:00:00", "02:00:00",
"03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00", "08:00:00",
"09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00",
"15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00", "20:00:00",
"21:00:00", "22:00:00", "23:00:00", "00:00:00", "00:00:00", "01:00:00",
"02:00:00", "03:00:00", "04:00:00", "05:00:00", "06:00:00", "07:00:00",
"08:00:00", "09:00:00", "10:00:00", "11:00:00", "12:00:00", "13:00:00",
"14:00:00", "15:00:00", "16:00:00", "17:00:00", "18:00:00", "19:00:00",
"20:00:00", "21:00:00", "22:00:00", "23:00:00", "00:00:00"),
Sender = c("Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person1", "Person1", "Person1", "Person1",
"Person1", "Person1", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2", "Person2", "Person2", "Person2",
"Person2", "Person2", "Person2"), Message = structure(c(3L,
3L, 2L, 2L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 3L, 3L, 1L, 4L, 1L,
1L, 5L, 3L, 2L, 2L, 1L, 3L, 4L, 1L, 3L, 5L, 4L, 2L, 5L, 1L,
1L, 2L, 3L, 4L, 5L, 5L, 3L, 1L, 2L, 5L, 5L, 4L, 5L, 2L, 1L,
1L, 3L, 1L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor")), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))https://stackoverflow.com/questions/59758639
复制相似问题