文章/答案/技术大牛

发布

社区首页 >问答首页 >如何有条件地检查和替换xts对象中的数据？

问如何有条件地检查和替换xts对象中的数据？
EN

Stack Overflow用户

提问于 2019-12-26 23:32:22

回答 4查看 351关注 0票数 1

这是一个可重现的数据集。问题是在一系列NA之间找到1或2个连续的非NA值，并将它们指定为NA。如果超过2个，则不需要执行任何操作。

set.seed(55)
data <- rnorm(10)
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:9*60

R <- xts(x = data, order.by = dates)
colnames(R) <- "R-factor"
R[c(1, 3, 6, 10)] <- NA
R

输出：

                        R-factor
2019-03-18 10:30:00           NA
2019-03-18 10:31:00 -1.812376850
2019-03-18 10:32:00           NA
2019-03-18 10:33:00 -1.119221005
2019-03-18 10:34:00  0.001908206
2019-03-18 10:35:00           NA
2019-03-18 10:36:00 -0.505343855
2019-03-18 10:37:00 -0.099234393
2019-03-18 10:38:00  0.305353199
2019-03-18 10:39:00           NA

预期结果：

                        R-factor
2019-03-18 10:30:00           NA
2019-03-18 10:31:00           NA
2019-03-18 10:32:00           NA
2019-03-18 10:33:00           NA
2019-03-18 10:34:00           NA
2019-03-18 10:35:00           NA
2019-03-18 10:36:00 -0.505343855
2019-03-18 10:37:00 -0.099234393
2019-03-18 10:38:00  0.305353199
2019-03-18 10:39:00           NA

我已经用for-loop写了一个函数，它对于一个小的数据集来说工作得很好，但它非常慢。原始数据由100,000+数据点组成，此函数在超过10分钟后无法执行它

有没有人可以帮我避免这个循环，让它更快？

for-loop

timestamp

time-series

xts

回答 4

Stack Overflow用户

回答已采纳

发布于 2019-12-27 02:12:02

创建一个Fillin函数，如果长度小于或等于3，则返回NA (如果第一个元素不是NA，则返回2，以便我们可以处理第一组，即使它不是以NA开头)，否则返回其参数。使用cumsum对管路进行分组，并将Fillin应用于每个组。

Fillin <- function(x) if (length(x) <= 3 - !is.na(x[1])) NA else x
Rc <- coredata(R)
R[] <- ave(Rc, cumsum(is.na(Rc)), FUN = Fillin)

给予：

> R
                       R-factor
2019-03-18 10:30:00          NA
2019-03-18 10:31:00          NA
2019-03-18 10:32:00          NA
2019-03-18 10:33:00          NA
2019-03-18 10:34:00          NA
2019-03-18 10:35:00          NA
2019-03-18 10:36:00 -0.50534386
2019-03-18 10:37:00 -0.09923439
2019-03-18 10:38:00  0.30535320
2019-03-18 10:39:00          NA

性能

此解决方案与使用rle的解决方案运行速度大致相同。

library(microbenchmark)

microbenchmark(
  Fill = { Fillin <- function(x) if (length(x) <= 3 - !is.na(x[1])) NA else x
    Rc <- coredata(R)
    R[] <- ave(Rc, cumsum(is.na(Rc)), FUN = Fillin)
  },
  RLrep = { rleR <-  rle(c(is.na(R[,1]))) 
    is.na(R) <- with(rleR,  rep(lengths < 3 , lengths ) )
  }
)

给予：

Unit: microseconds
  expr   min    lq    mean median     uq    max neval cld
  Fill 490.9 509.5 626.550  527.7 596.45 3411.1   100   a
 RLrep 523.5 540.8 604.061  550.8 592.00 1244.4   100   a

票数 4

Stack Overflow用户

发布于 2019-12-27 00:43:20

也许可以基于Distance from the closest non NA value in a dataframe尝试一下

library(tidyverse)

set.seed(55)
x <- 100000
data <- rnorm(x)
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + (seq_len(x))*60
time_table1 <- tibble(time = dates,data = data)
time_table <- time_table1 %>% 
  mutate(random = rnorm(x),
         new = if_else(random > data,NA_real_,data)) %>% 
  select(-data,-random) %>% 
  rename(data= new)



lengths_na <- time_table$data %>% is.na %>% rle  %>% pluck('lengths')

the_operation <- . %>% 
  mutate(lengths_na =lengths_na %>% seq_along %>% rep(lengths_na)) %>% 
  group_by(lengths_na) %>%
  add_tally() %>%
  ungroup() %>% 
  mutate(replace_sequence = if_else(condition = n < 3,true = NA_real_,false = data))

microbenchmark::microbenchmark(time_table %>% the_operation)

结果相当不错。

Unit: milliseconds
                         expr      min       lq     mean  median       uq      max neval
 time_table %>% the_operation 141.9009 176.2988 203.3744 190.183 214.1691 412.3161   100

也许这篇文章读起来更简单

library(tidyverse)

set.seed(55)

# Create the data

x <- 100
data <- rnorm(x)
dates <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + (seq_len(x))*60
time_table1 <- tibble(time = dates,data = data)

# Fake some na's
time_table <- time_table1 %>% 
  mutate(random = rnorm(x),
         new = if_else(random > data,NA_real_,data)) %>%
  select(-data,-random) %>% 
  rename(data= new)


# The rle function counts the occurrences of the same value in a vector,
# We create a T/F vector using is.na function
# meaning that we can count the lenght of sequences with or without na's
lengths_na <- time_table$data %>% is.na %>% rle  %>% pluck('lengths')

# This operation here can be done outside of the df
new_col <- lengths_na %>%
  seq_along %>% # Counts to the size of this vector
  rep(lengths_na) # Reps the lengths of the sequences populating the vector

result <- time_table %>%
  mutate(new_col =new_col) %>% 
  group_by(new_col) %>% # Operates the logic on this group look into the tidyverse
  add_tally() %>% # Counts how many instance there are on each group 
  ungroup() %>% # Not actually needed but good manners
  mutate(replace_sequence = if_else(condition = n < 3,true = NA_real_,false = data))

票数 2

Stack Overflow用户

发布于 2019-12-27 00:20:05

我猜，还有更好的解决方案，但这将时间缩短了一半。

    R_df=as.data.frame(R)

    R_df$shift_1=c(R_df$`R-factor`[-1],NA) #shift value one up
    R_df$shift_2=c(NA,R_df$`R-factor`[-nrow(R_df)]) #shift value one down

# create new filtered variable
    R_df$`R-factor_new`=ifelse(is.na(R_df$`R-factor`),NA,
                               ifelse((!is.na(R_df$shift_1))|(!is.na(R_df$shift_2)),
                                      R_df$`R-factor`,NA)

>                 test replications elapsed relative user.self sys.self user.child sys.child
>     2 ifelseapproach         1000    0.83    1.000      0.65     0.19         NA        NA
>     1       original         1000    1.81    2.181      1.76     0.01         NA        NA

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59490428

复制

相似问题

问如何有条件地检查和替换xts对象中的数据？
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何有条件地检查和替换xts对象中的数据？EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何有条件地检查和替换xts对象中的数据？
EN