有没有人知道是否可以根据不同的时间框架使用na.approx函数进行插值?
让我们建议我们有一个类似的df (真正的df有超过5000行):
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67我想根据日期范围进行线性插值。例如,两个NA开始: 1) 14-5 = 9,2)计数从2022-05-02到2022-05-06 =5天,所以我们除以3) 9/5 = 1.8。2022-05-03的NA值为6.8,2022-05-06的NA值为8.6.
第二个例子: 2022-05-09: 1) 67-15 = 52,2) 2022-05-08至2022-05-10 =3天,3) 52/3 = 17.33333.2022-05-09的NA值为32.33333 (= 15 + 17.33333)。
这可以用na.approx函数来执行吗?如果没有,我如何处理这个问题?
发布于 2022-11-15 11:24:32
使用“注释”末尾可重复定义的数据框架( DF ),我们看到,如果将na.approx应用于动物园系列,所要求的(即对Date的线性插值)就是(在修正了问题中的计算之后)得到的:
library(zoo)
z <- read.zoo(DF)
na.approx(z)
## 2022-05-01 2022-05-02 2022-05-03 2022-05-06 2022-05-07 2022-05-08 2022-05-09
## 6.0 5.0 6.8 12.2 14.0 15.0 41.0
## 2022-05-10
## 67.0 或者就原始数据框架而言,我们可以使用x=参数na.approx来指定插值是相对于Date的。
DF$Date <- as.Date(DF$Date)
transform(DF, Value = na.approx(Value, Date, na.rm = FALSE))
## Date Value
## 1 2022-05-01 6.0
## 2 2022-05-02 5.0
## 3 2022-05-03 6.8
## 4 2022-05-06 12.2
## 5 2022-05-07 14.0
## 6 2022-05-08 15.0
## 7 2022-05-09 41.0
## 8 2022-05-10 67.0建议您使用R.
备注
Lines <- "Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67"
DF <- read.csv(text = Lines, strip.white = TRUE)发布于 2022-11-15 10:41:03
我可以做一些艰苦的工作,但我可能需要一些帮助的日期差异。对于第一个病例,在02/05和07/05之间有5天。第二天是2天,而不是在08/05和10/05之间。我错过了什么吗?)
代码如下:
# get data into required shape, and using data.table package
df <- read.table(text="
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67
", header=T)
library(data.table)
library(zoo)
library(lubridate)
dt <- as.data.table(df)
dt[, Date := lubridate::ymd(gsub(",","",`Date.`))]
setorder(dt, Date)
# first step, fill in to get the starting value
dt[, Value2 := zoo::na.locf0(Value)]
# group together the rows, only really interested in the NA ones,
# ensuring they are grouped together. rleid makes a group where it finds new values
dt[, Group := rleid(is.na(Value))]
# find the value after the NA
dt[, ValueNext := shift(Value2, n=1, type="lead")]
# find the dates before and after the NA period
dt[, DatePre := shift(Date, n=1, type="lag")]
dt[, DateNext := shift(Date, n=1, type="lead")]
# find the differences in the values & dates
dt[, ValueDiff := ValueNext[.N]-Value2[1], by=Group]
dt[, DateDiff := as.integer(DateNext[.N]-DatePre[1]), by=Group]
# divide through to get the addition
dt[, ValueAdd := ValueDiff/DateDiff]
# by group, use cumulative sum to add to the starting value
dt[, ValueOut := Value2+cumsum(ValueAdd), by=Group]
# we only care about NA groups, so revert back to original value for other
# cases
dt[!is.na(Value), ValueOut := Value]
# check the NA rows
# ! only difference is I get 2 as the date diff for 2022-05-09, not 3
dt[is.na(Value),]
# Final output
dt[, .(Date, Value, ValueOut)]https://stackoverflow.com/questions/74442495
复制相似问题