我在R中有用户级别的数据,它显示了用户与之交互的不同组。我想要弄清楚的是这些群体之间的重叠。我有一个下面的数据样本:
Group UserID
A User1
B User1
D User1
A User2正如你所看到的,User1已经与3个群组进行了交互,而User2只与A组进行了交互。我想知道的是每个群组所拥有的用户的“市场份额”。例如,A组有100,000个用户只与A组交互,10,000个用户与A&B交互,5,000个用户与A,B和C交互,等等。显然有很多组合。
有没有办法用tidyr/dplyr来计算呢?大约有100万个用户和600个组。每个用户都有可能与某些组进行交互,但不是所有组。
谢谢!
发布于 2020-02-07 06:10:35
如果你只想知道其他交互的数量,这是可行的:
df <- data.frame(Group = c("A", "B", "D", "A", "C", "D", "D", "B", "C"),
UserID = c("User1", "User1", "User1", "User2", "User2", "User3", "User4", "User5", "User5"))
library(tidyverse)
df %>%
group_by(Group, UserID) %>% # make sure there are no double entries
summarise() %>% # make sure there are no double entries
group_by(UserID) %>%
mutate(NGroups = n()) %>% # how many interactions has this user
ungroup() %>%
group_by(Group, NGroups) %>%
summarise(N = n()) %>% # count for each Group - NInteractions combination the frequency
ungroup() %>%
pivot_wider(names_from = NGroups, values_from = N)如果您想要每个单独的组组合的计数,您应该可以开始使用;):
df %>%
group_by(Group, UserID) %>%
summarise() %>%
group_by(UserID) %>%
mutate(GroupsString = paste0(Group, collapse="")) %>%
ungroup() %>%
group_by(Group, GroupsString) %>%
summarise(N = n()) %>%
ungroup()https://stackoverflow.com/questions/60103687
复制相似问题