我一直试图在R中做一次关于销售的数据分析练习。数据文件就像:
Order_ID Product
<dbl> <chr>
1 319631 34in Ultrawide Monitor
2 319631 Lightning Charging Cable
3 319596 iPhone
4 319596 Lightning Charging Cable
5 319584 iPhone
6 319584 Wired Headphones
7 319556 Google Phone
8 319556 Wired Headphones我必须找出哪些产品是最常一起购买的,Order_ID有副本,即它们是由同一个人购买的。
我用Python做了这个节选,但我不能在R上做,我的Python代码是:
pares_compras[['Order ID', 'Product']]
> Order ID Product
2 176560 Google Phone
3 176560 Wired Headphones
17 176574 Google Phone
18 176574 USB-C Charging Cable
29 176585 Bose SoundSport Headphones
pares_compras.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
>
2 Google Phone,Wired Headphones
3 Google Phone,Wired Headphones
17 Google Phone,USB-C Charging Cable
18 Google Phone,USB-C Charging Cable
pares_compras = pares_compras[['Order ID', 'Grouped Products']].drop_duplicates()
pares_compras
> Order ID Grouped Products
2 176560 Google Phone,Wired Headphones
17 176574 Google Phone,USB-C Charging Cable
29 176585 Bose SoundSport Headphones,Bose SoundSport Hea...
31 176586 AAA Batteries (4-pack),Google Phone
118 176672 Lightning Charging Cable,USB-C Charging Cable
count = Counter()
for row in pares_compras['Grouped Products']:
row_list = row.split(',')
count.update(Counter(combinations(row_list, 2)))
count
> Counter({('Google Phone', 'Wired Headphones'): 414,
('Google Phone', 'USB-C Charging Cable'): 987,
('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27, ... )}
for key, num in count.most_common(5):
print(key, num)
>
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361所以,我可以解决这个问题,但是,就像我之前说过的,我不能在R中做同样的事情,我找不到方法,我刚在R里开始,如果有人能帮我,我会很感激的,萨克斯。
发布于 2022-08-02 01:13:44
嗯,我认为共现矩阵实际上是一个很好的解决方案。
另一种方法是考虑产品配置文件有多不同或有多相似。
orders <- read.csv(header = TRUE, text ='
"row", "order", "product"
1, 319631, "34in Ultrawide Monitor"
2, 319631, "Lightning Charging Cable"
3, 319596, "iPhone"
4, 319596, "Lightning Charging Cable"
5, 319584, "iPhone"
6, 319584, "Wired Headphones"
7, 319556, "Google Phone"
8, 319556, "Wired Headphones"') |>
dplyr::mutate(product = trimws(product))
df <- tidyr::pivot_wider(orders,
values_from = product,
names_from = product,
id_cols = order) |>
dplyr::mutate(across( `34in Ultrawide Monitor`:`Google Phone` ,
~!is.na(.x))) |>
select(-order)
cor(df)
dist(t(df))
dist(t(df), method = "binary")发布于 2022-08-02 02:03:25
我就把这个留给你,作为你的另一个选择。
在这里,我在两个data.frames中列出了唯一的组合,并使用嵌套的apply函数检查是否相同,并在cbinding之后用rowsums计算结果。
a <- expand.grid(a = df$Product,b = df$Product) |>
rowwise() |>
mutate(c = list(sort(c(a, b))), a = c[[1]], b = c[[2]]) |>
distinct() |>
filter(a != b)
b <- df |>
group_by(Order_ID) |>
summarise(Product = list(c(Product)))
a$count <- rowSums(do.call(cbind,
lapply(b$Product, \(one) sapply(a$c, \(two) +(all(two %in% one)))))) a b count
<chr> <chr> <dbl>
1 34inUltrawideMonitor LightningChargingCable 1
2 34inUltrawideMonitor iPhone 0
3 34inUltrawideMonitor WiredHeadphones 0
4 34inUltrawideMonitor GooglePhone 0
5 iPhone LightningChargingCable 1
6 LightningChargingCable WiredHeadphones 0
7 GooglePhone LightningChargingCable 0
8 iPhone WiredHeadphones 1
9 GooglePhone iPhone 0
10 GooglePhone WiredHeadphones 1发布于 2022-08-02 14:55:12
使用data.table连接而不是共生矩阵的解决方案。使用更大的数据集(~3M行),它在我的机器上的速度几乎是使用来自this answer的this answer的两倍。
library(data.table)
library(Matrix) # for comparison with a co-occurrence matrix solution
# Example dataset
n <- 1e6L
orderID <- rep.int(sample.int(n), rpois(n, 1) + 2L)
dt <- unique(data.table(orderID, product = stringi::stri_rand_strings(length(orderID), 2, pattern = "[a-z]")))
# solution using a data.table join
f1 <- function(dt) {
dt2 <- dt[
, x := .I
][
dt,
on = .(orderID = orderID, x > x),
nomatch = 0
][
product > i.product, c("product", "i.product") := list(i.product, product)
][
, .(count = .N), .(product, i.product)
]
dt[, x := NULL]
setnames(dt2, c("product1", "product2", "count"))
setorder(dt2, -count, product1, product2)
}
# co-occurrence matrix solution (slightly modified so the output of the two
# functions is the same)
f2 <- function(dt) {
dt$product <- as.factor(dt$product)
dt4 <- setDT(
summary(
crossprod(
xtabs(~ orderID + product, dt, sparse = TRUE)
)
)
)[
i < j
][
, `:=`(
i = as.character(levels(dt$product)[i]),
j = as.character(levels(dt$product)[j]),
x = as.integer(x)
)
]
dt[, product := as.character(product)]
attr(dt4, "header") <- NULL
setnames(dt4, c("product1", "product2", "count"))
setorder(dt4, -count, product1, product2)
}
Benchmarking:
#> Unit: seconds
#> expr min lq mean median uq max neval
#> f1 1.026762 1.128890 1.224291 1.261732 1.278617 1.362014 10
#> f2 1.984068 2.295159 2.355434 2.403337 2.465651 2.589428 10https://stackoverflow.com/questions/73200582
复制相似问题