文章/答案/技术大牛

发布

问如何通过ID遍历社区？
EN

Stack Overflow用户

提问于 2020-12-20 23:57:48

回答 3查看 57关注 0票数 0

我将在我的数据集中为数千个社区中的每个社区找到前10个标签。数据集中的每个user_name属于一个特定的社区(例如，"a“、"b”、"c“、"d”属于社区0)。我的数据集示例仅包含10个社区，如下所示：

df <- data.frame(N = c(1,2,3,4,5,6,7,8,9,10),
                  user_name = c("a","b","c","d","e","f", "g", "h", "i", "j"),
                  community_id =c(0,0,0,0,1,1,2,2,2,3),
                  hashtags   = c("#illness, #ebola", "#coronavirus, #covid", "#vaccine, #lie", "#flue, #ebola, #usa", "#vaccine", "#flue", "#coronavirus", "#ebola", "#ebola, #vaccine", "#china, #virus") )

为了找到每个社区(在下面的例子中是社区0)的前10个标签，我需要运行以下代码：

#select community 0
df_comm_0 <- df %>%
  filter (community == 0)

#remove NAs
df_comm_0 <- na.omit(df_comm_0)

#find top 10 hashtags
df_hashtags_0 <- df_comm_0 %>% 
unnest_tokens(hashtag, hashtags, token = "tweets") %>%
  count(hashtag, sort = TRUE) %>%
  top_n(10)

我知道使用循环可以节省我的代码大约15,000次(数据集中的社区数量)。我对循环并不熟悉，即使在搜索了几个小时后，也无法编写循环。下面的代码是我写的，它给了我整个数据集的hashtag！

x <- (df$community_id)

for (val in x) {
  
print (
df %>%
unnest_tokens(hashtag, hashtags, token = "tweets") %>%
  count(hashtag, sort = TRUE) %>%
  top_n(10)
)
}
print()

有没有办法通过循环遍历所有社区并将每个社区的前10个标签输出到一个文件(或单独的文件)来运行所有社区的hashtag freqs？

非常感谢你的助手。

for-loop

tweets

loops

回答 3

Stack Overflow用户

发布于 2020-12-21 01:04:25

通过社区，您可以在逗号处strsplit标签并对其进行unlist：names of a sorted table的前十个元素给出了所需的前十个标签，您可以使用paste返回原始格式。

aggregate(hash ~ community, df1, function(x)
  paste(names(sort(table(unlist(strsplit(x, ", "))), decreasing=TRUE)[1:5]),
        collapse=", "))
#    community                                                     hash
# 1          1            #covid, #fatalities, #china, #ebola, #illness
# 2          2                  #ebola, #lie, #usa, #covid, #fatalities
# 3          3                #vaccine, #ebola, #farright, #usa, #virus
# 4          4             #china, #vaccine, #flue, #virus, #conspiracy
# 5          5         #illness, #lie, #conspiracy, #ebola, #fatalities
# 6          6         #farright, #fatalities, #china, #ebola, #illness
# 7          7         #virus, #illness, #covid, #conspiracy, #farright
# 8          8                #lie, #china, #flue, #coronavirus, #covid
# 9          9        #conspiracy, #ebola, #fatalities, #farright, #lie
# 10        10 #china, #fatalities, #vaccine, #conspiracy, #coronavirus

为了清楚起见，我显示了前五个标签，对于前十个标签，在函数中使用[1:10]而不是[1:5]。

数据：

n <- 100
df1 <- data.frame(user=1:n, community=rep(1:(n/10), each=10))
set.seed(42)
df1$hash <- 
  replicate(n, paste(sample(c("#illness", "#ebola", "#coronavirus", "#covid",
                              "#vaccine",  "#lie", "#flue", "#usa", "#china",
                              "#fatalities", "#conspiracy", "#farright", 
                              "#virus"), 3), collapse=", "))

票数 0

Stack Overflow用户

发布于 2020-12-21 02:52:00

使用tidyverse，您可以执行以下操作：

df %>%
  group_by(community_id) %>%
  tidytext::unnest_tokens(hashtags, hashtags) %>%
  count(hashtags)%>%
  slice_max(n, n = 5)%>%
  summarise(hashtags = toString(hashtags), .groups = 'drop')

票数 0

Stack Overflow用户

发布于 2020-12-22 12:27:17

拆分-应用-合并：

tt_by_cid <- Map(function(x){
  head(names(sort(table(unlist(strsplit(x, ", "))), decreasing = TRUE)), 10)}, 
with(df, split(sapply(hashtags, as.character), community_id)))

data.frame(do.call(rbind, mapply(cbind, "community_id" = names(tt_by_cid), 
       hashtags = tt_by_cid, SIMPLIFY = TRUE)), stringsAsFactors = FALSE, row.names = NULL)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65381806

复制

相似问题

问如何通过ID遍历社区？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何通过ID遍历社区？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何通过ID遍历社区？
EN