我有一个data.table,看起来像这样:
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7)
M <- data.table(A,B,C,D) 我的问题类似于R help: divide values by sum produced through factor,但有一些额外的考虑。A指定了一个日期(这里我只是使用整数)。B是个体。C是一个分类中的个体所属。D是一个值变量。
对于C的每个分类c,对于每个a of A,将值D除以c中所有个人的值的总和,在需要时向后进位,使得0<x-a<=N,其中x是另一个人的日期(这意味着我们选择最小的x-a,并将其用作组c中另一个人在a天的值的近似值)。
假设是N=5,这是我的预期输出。
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7/(0.7+0.3), 0.3/(0.3), 0.5/(0.5), 0.9/(0.9), 4/(4+7), 7/(4+7))
M <- data.table(A,B,C,D) 请注意,对于个体3,组B的值不会向后进位,因为长度大于5 (20-5)。在data.table中有没有这样做的好方法?
对于D中的每个值,我希望除以当天同一组( I、II、II)的所有值的总和。但是,您会注意到,对于某些组,当天并不存在观察结果。我将尝试根据几个观察结果来演练其中的逻辑。
编辑:让我试着浏览几个案例。
对于第1天的个体1(列B) (列A),该个体属于组I(列C)。组I的其他个体是: 2.对于其他每个个体,我们看到对于个体2,他们最近的观察值是在第3天,并且3-1<=5,因此我们将在分母中使用0.3。
对于第5天的个体3(列B) (列A),个体属于组II (列C)。第二组的其他个体是: 3.对于其他每个个体,我们看到对于个体3,他们最近的观察是在第20天,20-5>5,所以我们不能在分母中使用他们的观察。
发布于 2017-05-25 11:23:20
我想,这会给你答案:
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
V <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
N=5
#Put data into a frame
test = data.frame(A,B,C,V)
#order the data
test = test[order(as.numeric(test$C), test$A),]
#Get the 'rollback' possibilities for each value
Roll = sapply(test$A, FUN = function(x){paste(which(test$A < (x+N) & test$A >= x), collapse=",")})
#Get the groupings
Group = sapply(test$C, FUN = function(x){paste(which(test$C == x), collapse=",")})
#Intersect the values
ToGet = apply(cbind(Roll, Group), MARGIN=1, FUN=function(x){intersect(unlist(strsplit(x[1],",")), unlist(strsplit(x[2],",")))})
#Calculate the denominators
test$D = sapply(ToGet, FUN=function(x){sum(test$V[as.numeric(x)])})
test$Calc = test$V/test$D输出:
> test
A B C V D Calc
1 1 1 I 0.7 1.0 0.7000000
2 3 2 I 0.3 0.4 0.7500000
7 7 7 I 0.1 0.1 1.0000000
3 5 3 II 0.5 0.5 1.0000000
4 20 4 II 0.9 0.9 1.0000000
5 21 5 III 4.0 11.0 0.3636364
6 21 6 III 7.0 11.0 0.6363636发布于 2017-05-25 17:08:01
这些问题是用data.table标记的,所以这里是一个data.table解决方案,它使用非等联接来识别每个组中的个人,如果观察落入5天的日期窗口内,则将他们视为队列。
library(data.table) # CRAN version 1.10.4 used
# set length of date window in days
N <- 5L
# give columns more semantic names according to OP's description
setnames(M, c("day", "id", "grp", "val"))
# prepare data for non-equi join: allowable date range
ranged <- M[, .(start = day, end = day + N, co.id = id, grp)]
# non-equi join to determine cohort
joined <- M[ranged, on = c("grp", "day>=start", "day<=end")]
# compute denominator for each cohort
grouped <- joined[, .(den = sum(val)), by = co.id]
# final update on join and order
result <- M[grouped, on = c("id==co.id"), calc := val / den][order(grp, id)]
result
# day id grp val calc
#1: 1 1 I 0.7 0.7000000
#2: 3 2 I 0.3 0.7500000
#3: 7 7 I 0.1 1.0000000
#4: 5 3 II 0.5 1.0000000
#5: 20 4 II 0.9 1.0000000
#6: 21 5 III 4.0 0.3636364
#7: 21 6 III 7.0 0.6363636数据
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
M <- data.table(A,B,C,D)紧凑版本
对于那些喜欢紧凑代码的人,这里有一个更复杂的版本:
joined <- M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")]
M[joined[, .(den = sum(val)), by = co.id], on = c("id==co.id"),
calc := val / den][order(grp, id)]或者,作为“一行程序”:
M[M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")
][, .(den = sum(val)), co.id],
on = c("id==co.id"), calc := val / den][order(grp, id)]https://stackoverflow.com/questions/44171153
复制相似问题