这个问题是问您如何“重构”以前基于segment_id折叠的数据框架。将开始变量和结束变量包含到扩展到每个间隔内的每个元素的表中。
考虑到以下示例数据集:
my_df <- structure(list(group_id = c(1, 2, 2, 2, 3,
3, 3, 4, 4, 5, 6, 6, 6,
7, 7, 7, 8, 9), start = c(1L, 1L, 13L, 24L, 1L, 16L, 30L, 1L, 14L, 1L, 1L, 6L, 11L, 1L, 9L, 20L,
1L, 1L), end = c(22L, 13L, 24L, 27L, 16L, 30L, 51L, 14L,
26L, 8L, 6L, 11L, 17L, 9L, 20L, 26L, 17L, 14L), segment_id = c(1L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
1L)), row.names = 3377225:3377242, class = "data.frame", .Names = c("group_id",
"start", "end", "segment_id"))应用以下预处理非常关键:
my_df [my_df $start > 1, "start"] <- my_df [my_df $start > 1, "start"] +1您可以在数据中观察到,信息segment_id用于折叠data.frame,每个segment的start和end元素分别保存在变量start和end中。
我正在努力寻找一种有效的解决方案,它可以超过数百万条记录,并给出以下结果:
group_id <- c(rep(1, 22), rep(2, 27), rep(3, 51), rep(4, 26), rep(5, 8), rep(6, 17), rep(7, 26), rep(8, 17), rep(9, 14))
element_id <- c(seq.int(1, 22), seq.int(1, 27), seq.int(1, 51), seq.int(1, 26), seq.int(1, 8), seq.int(1, 17), seq.int(1, 26), seq.int(1, 17), seq.int(1, 14))
segment_id <- c(rep(1, 22), rep(1, 13), rep(2, (24-13)), rep(3, (27-24)), rep(1, 16), rep(2, (30-16)), rep(3, (51-30)), rep(1, 14), rep(2, (26-14)), rep(1, 8), rep(1, 6), rep(2, (11-6)), rep(3, (17-11)), rep(1, 9), rep(2, (20-9)), rep(3, (26-20)), rep(1, 17), rep(1,14))
solution_df <- data.frame(group_id, element_id, segment_id)我找到的唯一解决方案是在矩阵中转换data.frame并对所有段执行一个循环。
为了澄清,请毫不犹豫地问。
发布于 2017-12-12 10:49:58
my_df <- structure(list(group_id = c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 7, 7, 7, 8, 9),
start = c(1L, 1L, 13L, 24L, 1L, 16L, 30L, 1L, 14L, 1L, 1L, 6L, 11L, 1L, 9L, 20L, 1L, 1L),
end = c(22L, 13L, 24L, 27L, 16L, 30L, 51L, 14L, 26L, 8L, 6L, 11L, 17L, 9L, 20L, 26L, 17L, 14L),
segment_id = c(1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L)),
row.names = 3377225:3377242, class = "data.frame", .Names = c("group_id", "start", "end", "segment_id"))
library(tidyverse)
my_df %>%
mutate(start = ifelse(start > 1 , start + 1, start)) %>% # update start values
group_by(group_id, segment_id) %>% # for each group and segment id combination
nest() %>% # create a dataset with the rest of the columns
mutate(element_id_new = map(data, ~ seq(.$start, .$end, 1))) %>% # get a sequence of values from start to end
unnest(element_id_new) # unnest the sequence
# # A tibble: 208 x 3
# group_id segment_id element_id_new
# <dbl> <int> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 1 1 3
# 4 1 1 4
# 5 1 1 5
# 6 1 1 6
# 7 1 1 7
# 8 1 1 8
# 9 1 1 9
# 10 1 1 10
# # ... with 198 more rows发布于 2017-12-13 22:51:27
有一种使用data.table的替代方法
library(data.table)
setDT(my_df)[start == 1, start := 0][
, .(group_id = rep(group_id, end - start), segment_id = rep(segment_id, end - start))][
, element_id := rowid(group_id)][]group\_id segment\_id element\_id 1: 1 1 1 2: 1 1 2 3: 1 1 3 4: 1 1 4 5: 1 1 5 --- 204: 9 1 10 205: 9 1 11 206: 9 1 12 207: 9 1 13 208: 9 1 14
解释
所请求的更正只应用于start == 1中的少数条目--但按照OP的建议以不同的方式应用。这减少了更新的数量,即不复制整个对象,而且我们可以避免在计算每条条纹的长度时添加+ 1。
然后,根据group_id和segment_id的要求,重复end - start的次数。最后,通过使用element_id函数对每个group_id中的行编号来附加rowid()。
https://stackoverflow.com/questions/47770490
复制相似问题