我使用以下脚本从一个表中查找统计摘要:
select group_id, --group_id=0 a constant
percentile_approx(A , 0.5) as A_mdn,
percentile_approx(A , 0.25) as A_Q1,
percentile_approx(A , 0.75) as A_Q3,
percentile_approx(A , array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile_approx(B , 0.5) as B_mdn,
percentile_approx(B , 0.25) as B_Q1,
percentile_approx(B , 0.75) as B_Q3,
percentile_approx(B , array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id;我得到的结果是:
0
73.21058033222496
73.21058033222496
462.16968382794516
[73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,8.11278644563614,17.0]然后,我将代码修改如下:
select group_id, --group_id=0 a constant
percentile(cast(A as bigint), 0.5) as A_mdn,
percentile(cast(A as bigint), 0.25) as A_Q1,
percentile(cast(A as bigint), 0.75) as A_Q3,
percentile(cast(A as bigint), array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile(cast(B as bigint), 0.5) as B_mdn,
percentile(cast(B as bigint), 0.25) as B_Q1,
percentile(cast(B as bigint), 0.75) as B_Q3,
percentile(cast(B as bigint), array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id新的结果是:
0
72.0
6.0
762.0
[3.0,1.0,1.0,0.0,0.0,0.0]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,9.0,17.0]为了再次验证真相,我还将此表加载到R。下面是R-结果:
A:
Min 0
Q1: 6
Median: 72
Q3: 762
0.2 quantile: 3
0.15 quantile: 1.5
0.1 quantile: 1
0.05 quantile: 0
0.025 quantile:0
0.001 quantile:0
B
Q1: 1
Median: 1
Q3: 2
0.8 quantile: 2
0.85 quantile: 3
0.9 quantile: 4
0.95 quantile: 9
0.975 quantile:17显然,R的结果与百分位数函数是一致的,但是percentile_approx给出了错误的答案。
发布于 2020-06-23 00:50:04
是的,percentile_approx没有任何近似保证,除非您将accuracy设置为大于或等于数据点的#。
通过快速阅读,要点是它创建了accuracy桶,然后当桶用完时,它通过找到两个最近的桶并将它们与加权和结合起来来合并桶。
不过,这将与各种输入中断。特别是,如果您的数据点非常高/非常低,并且彼此间隔很远,则会破坏算法。如果您首先将数据剪辑到不存在很多异常值的范围内,那么它的性能应该会更好。
但是,如果您的数据太偏斜,您可以考虑随机抽样数据并计算非接近百分位数。
发布于 2016-11-30 01:45:43
如果“所有”值都是整数,则此函数返回一个真值。你说过几乎所有的A和B都是整数。
尝试将完整的A列转换为int,看看是否接近了答案。
我不认为,你会得到和R完全一样的答案,因为R的百分位数函数很可能也是非整数的。
获得确切答案的一种方法是编写自己的UDF并使用它。
希望这能有所帮助!
https://stackoverflow.com/questions/40877784
复制相似问题