我有一系列整数范围(V1到V2)的scores (V3)数据帧。
scores <- structure(list(V1 = c(2037651L, 2037659L, 2037677L, 2037685L,
2037703L, 2037715L), V2 = c(2037700L, 2037708L, 2037726L, 2037734L,
2037752L, 2037764L), V3 = c(1.474269, 1.021012, 1.180993, 1.717131,
2.361985, 1.257013)), .Names = c("V1", "V2", "V3"), class = "data.frame",
row.names = c(NA, -6L))
V1 V2 V3
1 2037651 2037700 1.474269
2 2037659 2037708 1.021012
3 2037677 2037726 1.180993
4 2037685 2037734 1.717131
5 2037703 2037752 2.361985
6 2037715 2037764 1.257013我还有一个整数向量。
coords <- structure(list(V1 = c(2037652, 2037653, 2037654, 2037655, 2037656,
2037657, 2037658, 2037659, 2037660, 2037661, 2037662, 2037663,
2037664, 2037665, 2037666, 2037667, 2037668, 2037669, 2037670,
2037671)), .Names = "V1", row.names = c(NA, -20L), class = "data.frame")对于每个整数(以coords为单位),我要确定其整数范围(分数从V1到V2)包含coord$V1的所有分数(以scores$V3为单位)的平均值。为了做到这一点,我尝试了:
for(i in 1:nrow(coord)){
range_scores <- subset(scores,
scores$V1 <= coord$V1[i] & scores$V2 >= coord$V1[i])
coord$V2[i] <- mean(range_scores$V3)
}该函数可以工作,但速度非常慢。
我怎样才能更有效率地完成同样的事情?
发布于 2012-07-03 09:57:58
以下是我提出的解决方案:
scores = read.table(header=FALSE,
text="2037651 2037700 1.474269
2037659 2037708 1.021012
2037677 2037726 1.180993
2037685 2037734 1.717131
2037703 2037752 2.361985
2037715 2037764 1.257013")
coord = data.frame(V1=c(2037652, 2037653, 2037654, 2037655, 2037656, 2037657,
2037658, 2037659, 2037660, 2037661, 2037662, 2037663,
2037664, 2037665, 2037666, 2037667, 2037668, 2037669,
2037670, 2037671))
coord_vec = coord$V1 # Store as a vector instead of data.frame
scores_mat = as.matrix(scores) # Store as a matrix instead of data.frame
results = numeric(length=nrow(coord)) # Pre-allocate vector to store results.
for (i in 1:nrow(coord)) {
select_rows = ((scores_mat[, 1] <= coord_vec[i]) &
(scores_mat[, 2] >= coord_vec[i]))
scores_subset = scores_mat[select_rows, 3] # Use logical indexing.
results[i] = mean(scores_subset)
}
results
# [1] 1.474269 1.474269 1.474269 1.474269 1.474269 1.474269 1.474269 1.247641
# [9] 1.247641 1.247641 1.247641 1.247641 1.247641 1.247641 1.247641 1.247641
# [17] 1.247641 1.247641 1.247641 1.247641
# Benchmark results using @GSee's code. Needs library(rbenchmark).
# test replications elapsed relative user.self sys.self
# 4 bdemarest 100 0.046 1.000000 0.046 0.001
# 2 gsee 100 0.170 3.695652 0.170 0.001
# 1 orig 100 0.358 7.782609 0.360 0.001
# 3 sepehr 100 0.163 3.543478 0.164 0.000它似乎比其他提案要快得多。我确信这一优势是通过避免读取或写入data.frame (一个高开销函数)而获得的。此外,我使用逻辑索引而不是subset()来进一步减少开销。是否可以通过使用*ply策略来使其更快?
发布于 2012-07-03 09:03:24
coord$V2 <- sapply(coord$V1, function(x) mean(scores[scores[, 2] >= x & x >= scores[, 1], 3]))的速度大约是它的两倍。
首先,重新创建数据:
scores <- read.table(text=" V1 V2 V3
1 2037651 2037700 1.474269
2 2037659 2037708 1.021012
3 2037677 2037726 1.180993
4 2037685 2037734 1.717131
5 2037703 2037752 2.361985
6 2037715 2037764 1.257013", row.names=1)
coord <-data.frame(V1=c(2037652, 2037653, 2037654, 2037655, 2037656, 2037657, 2037658,
2037659, 2037660, 2037661, 2037662, 2037663, 2037664, 2037665,
2037666, 2037667, 2037668, 2037669, 2037670, 2037671))Make函数和基准测试:
gsee <- function(coord) {
coord$V2 <- sapply(coord$V1, function(x) mean(scores[scores[, 2] >= x & x >= scores[, 1], 3]))
coord
}
orig <- function(coord) {
for(i in 1:NROW(coord)){
range_scores<-subset(scores, scores$V1 <= coord$V1[i] & scores$V2 >= coord$V1[i]);
coord$V2[i]<-mean(range_scores$V3)
}
coord
}
identical(gsee(coord), orig(coord)) # TRUE
benchmark(orig=orig(coord), gsee=gsee(coord))
test replications elapsed relative user.self sys.self user.child sys.child
2 gsee 100 0.175 1.000000 0.175 0.000 0 0
1 orig 100 0.379 2.165714 0.377 0.002 0 0 编辑:每@Sepehr的lapply略好一些。
sepehr <- function(coord) {
coord$V2 <- unlist(lapply(coord$V1, function(x) mean(scores[scores[, 2] >= x & x >= scores[, 1], 3])))
coord
}
benchmark(orig=orig(coord), gsee=gsee(coord), sepehr=sepehr(coord))
test replications elapsed relative user.self sys.self user.child sys.child
2 gsee 100 0.171 1.023952 0.171 0.000 0 0
1 orig 100 0.369 2.209581 0.369 0.001 0 0
3 sepehr 100 0.167 1.000000 0.167 0.000 0 0https://stackoverflow.com/questions/11302883
复制相似问题