假设我有足够大的数据格式,大约有一百万行。
我想在一个数据文件中删除BSM和ENDBSM之间的行,如何有效地做到这一点呢?
我想首先用1标记行,我需要使用下面的循环提取行,但这需要花费很长时间。
chkSTR = 0
for(i in 1:nrow(rDATA)){
if(rDATA$Data[i] == "BSM"){
chkSTR = 1
}
if(rDATA$Data[i] == "ENDBSM"){
chkSTR = 0
}
rDATA$BOOL[i] = chkSTR
}输入数据帧示例
rData = data.frame(
Data =
c(1,"BSM","a",3,3,"ENDBSM",1,3,1,"BSM","b",3,3,"ENDBSM",1,2,1,"BSM","c",2,3,"ENDBSM",1,2)
)
Output example
rData = data.frame(
Data =
c("BSM","a",3,3,"ENDBSM","BSM","b",3,3,"ENDBSM","BSM","c",2,3,"ENDBSM")
)发布于 2019-06-19 06:41:36
您可以使用Reduce在BSM和ENDBSM之间创建触发器。不需要BSM和ENDBSM的数量相同,也不需要BSM优先。当BSM出现时,它会简单地打开,当ENDBSM出现时,它会反过来。
idx <- Reduce(function(y,x) {(y || x=="BSM") && x!= "ENDBSM"}, x=rData$Data, init=FALSE, accumulate=TRUE)
rData[idx[-1] | idx[-length(idx)], , drop = FALSE]
# Data
#2 BSM
#3 a
#4 3
#5 3
#6 ENDBSM
#10 BSM
#11 b
#12 3
#13 3
#14 ENDBSM
#18 BSM
#19 c
#20 2
#21 3
#22 ENDBSM如果您想摆脱周围的BSM和ENDBSM,可以使用以下方法完成:
rData[idx[-1] & idx[-length(idx)], , drop = FALSE]
# Data
#3 a
#4 3
#5 3
#11 b
#12 3
#13 3
#19 c
#20 2
#21 3发布于 2019-06-18 12:35:23
正如注释中提到的,"BSM" of "ENDBSM"的数量是相同的,"BSM"总是第一位的,我们可以使用mapply并在索引之间创建一个子集序列。
rData[c(mapply(`:`, which(rData$Data == "BSM"),
which(rData$Data == "ENDBSM"))), , drop = FALSE]
# Data
#2 BSM
#3 a
#4 3
#5 3
#6 ENDBSM
#10 BSM
#11 b
#12 3
#13 3
#14 ENDBSM
#18 BSM
#19 c
#20 2
#21 3
#22 ENDBSM发布于 2019-06-18 14:02:48
我们可以使用来自map2的purrr
library(purrr)
map2(which(rData$Data == "BSM"), which(rData$Data == "ENDBSM"), `:`) %>%
flatten_int %>%
extract2(rData, ., )https://stackoverflow.com/questions/56649249
复制相似问题