我在R中有一个很大的数据表:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)我想要透视这个data.table,这样Category就变成了一个列。不幸的是,由于组中类别的数量不是恒定的,所以我不能使用this answer。
你知道我该怎么做吗?
/edit:基于joran的评论和flodel的回答,我们真的在重塑以下data.table
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]这种重塑可以通过多种方式完成(到目前为止,我已经得到了一些不错的答案),但我真正想要的是能够很好地扩展到具有数百万行和成百上千个类别的data.table。
发布于 2014-03-13 22:58:07
data.table实现了melt/dcast data.table特定方法的更快版本(用C编写)。它还添加了用于熔化和转换多个列的附加功能。请参阅Efficient reshaping using data.tables短片。
请注意,我们不需要加载reshape2包。
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802发布于 2013-04-05 06:25:35
像这样吗?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")发布于 2013-04-05 07:19:40
没有特定于data.table的宽范围整形方法。
这是一种可行的方法,但它是相当昂贵的。
有一个功能请求#2619 Scoping for LHS in :=可以帮助使这一点变得更简单。
下面是一个简单的例子
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NAhttps://stackoverflow.com/questions/15822623
复制相似问题