我有一个名为my_data的数据。数据量大于100000。示例输出如下所示
id source
8166923397733625478 happimobiles
8166923397733625478 Springfit
7301100145962413274 Duroflex
6703062895304712434 happimobiles
6897156268457025524 themrphone
37564799155342281 Sangeetha Mobiles
1159098248970201145 Sangeetha Mobiles我使用了下面的代码和表(My_data)。
library("readxl")
my_data <- read_excel("C:\\Users\\ashishpatodia\\Desktop\\R\\Code\\Sample_Data_Overlap.xlsx",sheet = "10000 sample")
setDT(my_data)
(cohorts <- dcast(unique(my_data)[,cohort:=(source),by=id],cohort~ source, fun.aggregate=length, value.var="cohort"))我想要输出,其中每个id都应该在source下计数,并且在这个输出下,以5478结尾的Ex ID同时属于happimobiles和springfit。因此,happimobiles的id为8166923397733625478和6703062895304712434,这使其成为2和1,这在springfit中很常见。
输出
happimobiles Springfit Duroflex themrphone Sangeetha
happimobiles 2 1 0 0 0
Springfit 1 1 0 0 0
Duroflex 0 0 1 0 0
themrphone 0 0 0 1 0
Sangeetha 0 0 0 0 1我也试过
Pivot<-dcast(my_data,source~source,value.var = "id",function(x) length((x)))它只为我提供了特定合作伙伴中唯一的正确记录,而不是重叠。
我也试过了
crossprod(table(my_data))但是这并没有给出正确的答案。
链接到整个数据
我希望为其运行代码的https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0
发布于 2019-11-29 02:28:53
我们可以在base R的crossprod中使用table
crossprod(table(my_data))
# source
#source Duroflex happimobiles Sangeetha Mobiles Springfit themrphone
# Duroflex 1 0 0 0 0
# happimobiles 0 2 0 1 0
# Sangeetha Mobiles 0 0 2 0 0
# Springfit 0 1 0 1 0
# themrphone 0 0 0 0 1数据
my_data <- structure(list(id = c(8166923397733625856, 8166923397733625856,
7301100145962413056, 6703062895304712192, 6897156268457025536,
37564799155342280, 1159098248970201088), source = c("happimobiles",
"Springfit", "Duroflex", "happimobiles", "themrphone", "Sangeetha Mobiles",
"Sangeetha Mobiles")), class = "data.frame", row.names = c(NA,
-7L))https://stackoverflow.com/questions/59094585
复制相似问题