首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何找到与r中的多个标准匹配的最大值?

如何找到与r中的多个标准匹配的最大值?
EN

Stack Overflow用户
提问于 2022-08-30 19:10:17
回答 1查看 63关注 0票数 0

我有两个数据库,一个带有每日度量df,另一个带有基准度量(附带日期) benchmark_df。我希望找到每日度量的百分比,而不是基准度量的最大值--但只用于低于或等于以前日期的基准日期。

因此,一个直接的答案是使用left_join()mutate(perc_benchmark = metric_a / benchmark_metric_a),其中benchmark_metric_a是小于或等于metric_a日期的最大值。但是我不确定我是否可以用left_join()来完成这个任务,或者是否有一个解决办法。

因此,在下面的示例数据帧中。球员B将有一个perc_benchmark of 50% (0.5/1)从8-1到8-6,8-7将是25% (0.5/2),8-8将是75% (1.5/2)等等。

代码语言:javascript
复制
df <- tibble(player =  rep(LETTERS[1:2], times = 21),
             date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
             # metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
             metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7)))

benchmark_df <- tibble(player = rep(LETTERS[1:2], times = 3),
                       date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
                       benchmark_metric_a = c(1,1,0.5,2,1,3))
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-08-30 19:56:08

德普利包正在开发S

left_join()...where ...using benchmark_metric_a是小于或等于metric_a date的最大值

同时,这里有一个解决方案,它使用非马氏联接data.table包,其性能是规模优势

解决方案

首先,将您的数据集准备为data.table

代码语言:javascript
复制
library(data.table)


# ...
# Code to generate 'df' and 'benchmark_df'.
# ...


# Convert datasets into 'data.table's.
setDT(df)
setDT(benchmark_df)

然后应用以下data.table“链”,它类似于dplyr工作流:

代码语言:javascript
复制
# Perform a LEFT JOIN of 'df' to 'benchmark_df'...
result <- benchmark_df[df, .(  
    # ...using a subset of columns...
    player,
    bench_date = x.date,
    benchmark_metric_a,
    metric_date = i.date,
    metric_a
  ),
  # ...where players match and benchmark dates are earlier (or concurrent). 
  on = .(player, date <= date)
  
# Calculate the max benchmark...
][, .(
    benchmark_metric_a = max(benchmark_metric_a),
    # ...while preserving the metric...
    metric_a = first(metric_a)
  ),
  # ...for each player as of each date.
  by = .(player, metric_date)
  
# Calculate the percentage.
][, `:=`(
  perc_benchmark = metric_a / benchmark_metric_a
  
# Select the desired columns.
)][, .(
  player,
  date = metric_date,
  metric_a,
  perc_benchmark
  
# Restore original ordering from 'df': sort by 'date' and then 'player'.
)][order(
  date,
  player
)]

结果

给出一个像您的示例一样的dfbenchmark_df

代码语言:javascript
复制
library(dplyr)

df <- tibble(
  player =  rep(LETTERS[1:2], times = 21),
  date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
  # metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
  metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7))
)

benchmark_df <- tibble(
  player = rep(LETTERS[1:2], times = 3),
  date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
  benchmark_metric_a = c(1,1,0.5,2,1,3)
)

这个链应该为result生成以下result,其中df数据现在被perc_benchmark列增强:

代码语言:javascript
复制
    player       date metric_a perc_benchmark
 1:      A 2022-08-01      0.5           0.50
 2:      B 2022-08-01      0.5           0.50
 3:      A 2022-08-02      0.5           0.50
 4:      B 2022-08-02      0.5           0.50
 5:      A 2022-08-03      0.5           0.50
 6:      B 2022-08-03      0.5           0.50
 7:      A 2022-08-04      0.5           0.50
 8:      B 2022-08-04      0.5           0.50
 9:      A 2022-08-05      0.5           0.50
10:      B 2022-08-05      0.5           0.50
11:      A 2022-08-06      0.5           0.50
12:      B 2022-08-06      0.5           0.50
13:      A 2022-08-07      0.5           0.50
14:      B 2022-08-07      0.5           0.25
15:      A 2022-08-08      0.5           0.50
16:      B 2022-08-08      1.5           0.75
17:      A 2022-08-09      0.5           0.50
18:      B 2022-08-09      1.5           0.75
19:      A 2022-08-10      0.5           0.50
20:      B 2022-08-10      1.5           0.75
21:      A 2022-08-11      0.5           0.50
22:      B 2022-08-11      1.5           0.75
23:      A 2022-08-12      0.5           0.50
24:      B 2022-08-12      1.5           0.75
25:      A 2022-08-13      0.5           0.50
26:      B 2022-08-13      1.5           0.75
27:      A 2022-08-14      0.5           0.50
28:      B 2022-08-14      1.5           0.75
29:      A 2022-08-15      0.5           0.50
30:      B 2022-08-15      3.0           1.50
31:      A 2022-08-16      0.5           0.50
32:      B 2022-08-16      3.0           1.00
33:      A 2022-08-17      0.5           0.50
34:      B 2022-08-17      3.0           1.00
35:      A 2022-08-18      0.5           0.50
36:      B 2022-08-18      3.0           1.00
37:      A 2022-08-19      0.5           0.50
38:      B 2022-08-19      3.0           1.00
39:      A 2022-08-20      0.5           0.50
40:      B 2022-08-20      3.0           1.00
41:      A 2022-08-21      0.5           0.50
42:      B 2022-08-21      3.0           1.00
    player       date metric_a perc_benchmark

备注

如果您想将它转换回tibble表单,那么只需在result上使用as_tibble()即可。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73547406

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档