我有下面的数据集,显示了每个产品中包含的成分;
data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"),
"INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))我想知道每个产品中最常见的配料组合;哪种配料与哪种其他配料相关?我应用了我在这个线程here中找到的代码:
combinaisons_par_PRODUCT = data %>%
full_join(data, by="PRODUCT") %>%
group_by(INGREDIENT.x, INGREDIENT.y) %>%
summarise(n = length(unique(PRODUCT))) %>%
filter(INGREDIENT.x!=INGREDIENT.y) %>%
mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))它可以工作,但有一个最后的缺陷;我希望忽略该命令。例如,这段代码会给我一个HAHA和SASA的关联,还有一个SASA和HAHA的关联。但对我来说,这些都是一样的事情。因此,我希望代码忽略配料的顺序,并给我一个独特的关联2哈哈&萨萨。
我试着在应用代码之前对成分进行排序,但也不起作用。有人能帮帮我吗?我怎么能在不考虑顺序的情况下拥有这些组合呢?
非常感谢!
发布于 2021-06-20 00:30:22
这是你想要的吗?我仅限于组合词按字母顺序排列的情况,避免重复计数。
data %>%
full_join(data, by="PRODUCT") %>%
filter(INGREDIENT.x < INGREDIENT.y) %>%
count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))发布于 2021-06-20 03:55:04
使用graph_from_adjacency_matrix的igraph选项
library(igraph)
get.data.frame(
graph_from_adjacency_matrix(
crossprod(table(data)),
mode = "undirected",
weighted = TRUE
)
)给出
from to weight
1 haha haha 2
2 haha sasa 1
3 haha zaza 2
4 haha zemzem 1
5 haha zizi 1
6 haha zuzu 2
7 hayda hayda 1
8 hayda ozam 1
9 hayda zaza 1
10 ozam ozam 1
11 ozam zaza 1
12 sasa sasa 2
13 sasa zaza 1
14 sasa zemzem 1
15 sasa zuzu 2
16 zaza zaza 5
17 zaza zemzem 2
18 zaza zeze 1
19 zaza zizi 1
20 zaza zozo 2
21 zaza zuzu 4
22 zemzem zemzem 2
23 zemzem zozo 1
24 zemzem zuzu 2
25 zeze zeze 1
26 zeze zozo 1
27 zeze zuzu 1
28 zizi zizi 1
29 zizi zuzu 1
30 zozo zozo 2
31 zozo zuzu 2
32 zuzu zuzu 5发布于 2021-06-20 01:24:45
我们可以使用base R
m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)编辑:来自@ThomasIsCoding的评论
https://stackoverflow.com/questions/68048654
复制相似问题