我将在我的数据集中为数千个社区中的每个社区找到前10个标签。数据集中的每个user_name属于一个特定的社区(例如,"a“、"b”、"c“、"d”属于社区0)。我的数据集示例仅包含10个社区,如下所示:
df <- data.frame(N = c(1,2,3,4,5,6,7,8,9,10),
user_name = c("a","b","c","d","e","f", "g", "h", "i", "j"),
community_id =c(0,0,0,0,1,1,2,2,2,3),
hashtags = c("#illness, #ebola", "#coronavirus, #covid", "#vaccine, #lie", "#flue, #ebola, #usa", "#vaccine", "#flue", "#coronavirus", "#ebola", "#ebola, #vaccine", "#china, #virus") )为了找到每个社区(在下面的例子中是社区0)的前10个标签,我需要运行以下代码:
#select community 0
df_comm_0 <- df %>%
filter (community == 0)
#remove NAs
df_comm_0 <- na.omit(df_comm_0)
#find top 10 hashtags
df_hashtags_0 <- df_comm_0 %>%
unnest_tokens(hashtag, hashtags, token = "tweets") %>%
count(hashtag, sort = TRUE) %>%
top_n(10)我知道使用循环可以节省我的代码大约15,000次(数据集中的社区数量)。我对循环并不熟悉,即使在搜索了几个小时后,也无法编写循环。下面的代码是我写的,它给了我整个数据集的hashtag!
x <- (df$community_id)
for (val in x) {
print (
df %>%
unnest_tokens(hashtag, hashtags, token = "tweets") %>%
count(hashtag, sort = TRUE) %>%
top_n(10)
)
}
print()有没有办法通过循环遍历所有社区并将每个社区的前10个标签输出到一个文件(或单独的文件)来运行所有社区的hashtag freqs?
非常感谢你的助手。
发布于 2020-12-21 01:04:25
通过社区,您可以在逗号处strsplit标签并对其进行unlist:names of a sorted table的前十个元素给出了所需的前十个标签,您可以使用paste返回原始格式。
aggregate(hash ~ community, df1, function(x)
paste(names(sort(table(unlist(strsplit(x, ", "))), decreasing=TRUE)[1:5]),
collapse=", "))
# community hash
# 1 1 #covid, #fatalities, #china, #ebola, #illness
# 2 2 #ebola, #lie, #usa, #covid, #fatalities
# 3 3 #vaccine, #ebola, #farright, #usa, #virus
# 4 4 #china, #vaccine, #flue, #virus, #conspiracy
# 5 5 #illness, #lie, #conspiracy, #ebola, #fatalities
# 6 6 #farright, #fatalities, #china, #ebola, #illness
# 7 7 #virus, #illness, #covid, #conspiracy, #farright
# 8 8 #lie, #china, #flue, #coronavirus, #covid
# 9 9 #conspiracy, #ebola, #fatalities, #farright, #lie
# 10 10 #china, #fatalities, #vaccine, #conspiracy, #coronavirus为了清楚起见,我显示了前五个标签,对于前十个标签,在函数中使用[1:10]而不是[1:5]。
数据:
n <- 100
df1 <- data.frame(user=1:n, community=rep(1:(n/10), each=10))
set.seed(42)
df1$hash <-
replicate(n, paste(sample(c("#illness", "#ebola", "#coronavirus", "#covid",
"#vaccine", "#lie", "#flue", "#usa", "#china",
"#fatalities", "#conspiracy", "#farright",
"#virus"), 3), collapse=", "))发布于 2020-12-21 02:52:00
使用tidyverse,您可以执行以下操作:
df %>%
group_by(community_id) %>%
tidytext::unnest_tokens(hashtags, hashtags) %>%
count(hashtags)%>%
slice_max(n, n = 5)%>%
summarise(hashtags = toString(hashtags), .groups = 'drop')发布于 2020-12-22 12:27:17
拆分-应用-合并:
tt_by_cid <- Map(function(x){
head(names(sort(table(unlist(strsplit(x, ", "))), decreasing = TRUE)), 10)},
with(df, split(sapply(hashtags, as.character), community_id)))
data.frame(do.call(rbind, mapply(cbind, "community_id" = names(tt_by_cid),
hashtags = tt_by_cid, SIMPLIFY = TRUE)), stringsAsFactors = FALSE, row.names = NULL)https://stackoverflow.com/questions/65381806
复制相似问题