我有两个数据库,一个带有每日度量df,另一个带有基准度量(附带日期) benchmark_df。我希望找到每日度量的百分比,而不是基准度量的最大值--但只用于低于或等于以前日期的基准日期。
因此,一个直接的答案是使用left_join()和mutate(perc_benchmark = metric_a / benchmark_metric_a),其中benchmark_metric_a是小于或等于metric_a日期的最大值。但是我不确定我是否可以用left_join()来完成这个任务,或者是否有一个解决办法。
因此,在下面的示例数据帧中。球员B将有一个perc_benchmark of 50% (0.5/1)从8-1到8-6,8-7将是25% (0.5/2),8-8将是75% (1.5/2)等等。
df <- tibble(player = rep(LETTERS[1:2], times = 21),
date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
# metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7)))
benchmark_df <- tibble(player = rep(LETTERS[1:2], times = 3),
date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
benchmark_metric_a = c(1,1,0.5,2,1,3))发布于 2022-08-30 19:56:08
left_join()...where ...usingbenchmark_metric_a是小于或等于metric_a date的最大值。
同时,这里有一个解决方案,它使用非马氏联接和data.table包,其性能是规模优势。
解决方案
首先,将您的数据集准备为data.table。
library(data.table)
# ...
# Code to generate 'df' and 'benchmark_df'.
# ...
# Convert datasets into 'data.table's.
setDT(df)
setDT(benchmark_df)然后应用以下data.table“链”,它类似于dplyr工作流:
# Perform a LEFT JOIN of 'df' to 'benchmark_df'...
result <- benchmark_df[df, .(
# ...using a subset of columns...
player,
bench_date = x.date,
benchmark_metric_a,
metric_date = i.date,
metric_a
),
# ...where players match and benchmark dates are earlier (or concurrent).
on = .(player, date <= date)
# Calculate the max benchmark...
][, .(
benchmark_metric_a = max(benchmark_metric_a),
# ...while preserving the metric...
metric_a = first(metric_a)
),
# ...for each player as of each date.
by = .(player, metric_date)
# Calculate the percentage.
][, `:=`(
perc_benchmark = metric_a / benchmark_metric_a
# Select the desired columns.
)][, .(
player,
date = metric_date,
metric_a,
perc_benchmark
# Restore original ordering from 'df': sort by 'date' and then 'player'.
)][order(
date,
player
)]结果
给出一个像您的示例一样的df和benchmark_df
library(dplyr)
df <- tibble(
player = rep(LETTERS[1:2], times = 21),
date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
# metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7))
)
benchmark_df <- tibble(
player = rep(LETTERS[1:2], times = 3),
date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
benchmark_metric_a = c(1,1,0.5,2,1,3)
)这个链应该为result生成以下result,其中df数据现在被perc_benchmark列增强:
player date metric_a perc_benchmark
1: A 2022-08-01 0.5 0.50
2: B 2022-08-01 0.5 0.50
3: A 2022-08-02 0.5 0.50
4: B 2022-08-02 0.5 0.50
5: A 2022-08-03 0.5 0.50
6: B 2022-08-03 0.5 0.50
7: A 2022-08-04 0.5 0.50
8: B 2022-08-04 0.5 0.50
9: A 2022-08-05 0.5 0.50
10: B 2022-08-05 0.5 0.50
11: A 2022-08-06 0.5 0.50
12: B 2022-08-06 0.5 0.50
13: A 2022-08-07 0.5 0.50
14: B 2022-08-07 0.5 0.25
15: A 2022-08-08 0.5 0.50
16: B 2022-08-08 1.5 0.75
17: A 2022-08-09 0.5 0.50
18: B 2022-08-09 1.5 0.75
19: A 2022-08-10 0.5 0.50
20: B 2022-08-10 1.5 0.75
21: A 2022-08-11 0.5 0.50
22: B 2022-08-11 1.5 0.75
23: A 2022-08-12 0.5 0.50
24: B 2022-08-12 1.5 0.75
25: A 2022-08-13 0.5 0.50
26: B 2022-08-13 1.5 0.75
27: A 2022-08-14 0.5 0.50
28: B 2022-08-14 1.5 0.75
29: A 2022-08-15 0.5 0.50
30: B 2022-08-15 3.0 1.50
31: A 2022-08-16 0.5 0.50
32: B 2022-08-16 3.0 1.00
33: A 2022-08-17 0.5 0.50
34: B 2022-08-17 3.0 1.00
35: A 2022-08-18 0.5 0.50
36: B 2022-08-18 3.0 1.00
37: A 2022-08-19 0.5 0.50
38: B 2022-08-19 3.0 1.00
39: A 2022-08-20 0.5 0.50
40: B 2022-08-20 3.0 1.00
41: A 2022-08-21 0.5 0.50
42: B 2022-08-21 3.0 1.00
player date metric_a perc_benchmark备注
如果您想将它转换回tibble表单,那么只需在result上使用as_tibble()即可。
https://stackoverflow.com/questions/73547406
复制相似问题