我希望使用Hmisc::wtd.quantile为具有许多重复日期的数据文件创建一个新的df。我正在按日期分组,使用summarize()对日期进行聚合,并试图在每个日期上使用wtd.quantile() (带有权重)。这也是一个相当大的数据集。下面是一些示例代码:
# sample data
# grouping_var = dt_time
require(Hmisc)
require(plyr)
require(dplyr)
df <- data.frame(type = sample(letters[1:2], 10e6, replace = TRUE),
score = sample(500:899, 10e6, replace = TRUE),
dt_time = sample(seq(as.Date('2010/01/01'),
as.Date('2018/01/01'),
by="day"), 10e6, replace = TRUE),
weight = sample(1.0:2.0, 10e6, replace = TRUE))
# my attempt:
ptiles <- df %>%
group_by(dt_time) %>%
plyr::ddply(~dt_time, dplyr::summarize,
ptile10 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .1, na.rm = TRUE),
ptile50 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .5, na.rm = TRUE),
ptile90 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .9, na.rm = TRUE))
# desired df,
# where each new variable would be created using the
# wtd.quantile function:
desired_ptiles <- data.frame(dt_time = seq(as.Date('2010/01/01'),
as.Date('2010/01/06'),
by = "day"),
# only 6 because lol 10e6
ptile10 = sample(500:899, 6, replace = TRUE),
ptile50 = sample(500:899, 6, replace = TRUE),
ptile90 = sample(500:899, 6, replace = TRUE))到目前为止,我的努力导致了这一错误:
Error in summarise_impl(.data, dots) :
Evaluation error: 'arg' must be NULL or a character vector.当使用公式符号时:
ptiles <- df %>%
group_by(dt_time) %>%
plyr::ddply(~dt_time, dplyr::summarize,
ptile10 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .1, na.rm = TRUE),
ptile50 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .5, na.rm = TRUE),
ptile90 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
probs = .9, na.rm = TRUE))
# error message:
Error in summarise_impl(.data, dots) :
Evaluation error: 'arg' must be NULL or a character vector.我是不是走错路了?我见过使用split()的方法,但这似乎很烦人。是否有一个data.table方法允许以这种方式对wtd.quantile()进行总结?
谢谢!
发布于 2018-09-20 07:08:37
在使用ddply时不需要group_by,因为数据已经被分组分割了。另外,您不需要在分组后在summarize中定义数据。
这样做是可行的:
ptiles <- df %>%
group_by(dt_time) %>%
summarize(ptile10 = wtd.quantile(score, weights = weight,
probs = .1, na.rm = TRUE),
ptile50 = wtd.quantile(score, weights = weight,
probs = .5, na.rm = TRUE),
ptile90 = wtd.quantile(score, weights = weight,
probs = .9, na.rm = TRUE))
> ptiles
# A tibble: 2,923 x 4
dt_time ptile10 ptile50 ptile90
<date> <dbl> <dbl> <dbl>
1 2010-01-01 539.0 697 859.0
2 2010-01-02 538.0 704 861.7
3 2010-01-03 541.0 706 862.0
4 2010-01-04 541.0 702 859.0
5 2010-01-05 540.0 706 860.0
6 2010-01-06 537.0 695 859.0
7 2010-01-07 539.0 696 859.0
8 2010-01-08 536.0 700 857.0
9 2010-01-09 538.0 694 861.0
10 2010-01-10 538.4 701 859.0
# ... with 2,913 more rows发布于 2018-09-20 11:02:09
这里有一个替代解决方案,它使用一个函数(基于@LAP的解决方案)遍历您提供的一组分位数值(即c(.1, .5, .9)),并自动创建适当的列:
require(Hmisc)
require(plyr)
require(tidyverse)
# for reproducibility
set.seed(5)
# example dataset
df <- data.frame(type = sample(letters[1:2], 50000, replace = TRUE),
score = sample(500:899, 50000, replace = TRUE),
dt_time = sample(seq(as.Date('2010/01/01'),
as.Date('2018/01/01'),
by="day"), 50000, replace = TRUE),
weight = sample(1.0:2.0, 50000, replace = TRUE))
# function to get quantiles per date
f = function(x) {
df %>%
group_by(dt_time) %>%
dplyr::summarize(value = wtd.quantile(score, weights = weight, probs = x, na.rm = TRUE)) %>%
mutate(ptile = x) }
# input quantiles of interest
q = c(.1, .5, .9)
# apply function to each quantile of interest
# and reshape
map_df(q, f) %>%
spread(ptile, value, sep = "_")
# # A tibble: 2,923 x 4
# dt_time ptile_0.1 ptile_0.5 ptile_0.9
# <date> <dbl> <dbl> <dbl>
# 1 2010-01-01 543 670. 857
# 2 2010-01-02 549 800 868.
# 3 2010-01-03 529. 705 829.
# 4 2010-01-04 543 690 850.
# 5 2010-01-05 528. 642. 870.
# 6 2010-01-06 534. 649 810.
# 7 2010-01-07 512. 668 801.
# 8 2010-01-08 554 664 869.
# 9 2010-01-09 533. 650 765
#10 2010-01-10 535. 696 843.
# # ... with 2,913 more rowshttps://stackoverflow.com/questions/52418838
复制相似问题