我正在尝试计算人们在特定类别上花费的时间的中位数。我拥有的整个数据集大约有500k行,但我尝试在下面总结一下其中的一小段
person category time spent (in mins)
roger dota 20
jim dota 50
joe call of duty 5
jim fallout 25
kathy GTA 40
alicia fallout 100我已经尝试使用下面的查询,但我找不到任何地方。
SELECT x1.person, x1.time spent
from data x1, data x2
GROUP BY x1.val
HAVING SUM(SIGN(1-SIGN(x2.val-x1.val))) = (COUNT(*)+1)/2发布于 2015-11-18 07:21:08
在500,000行上执行自连接可能会耗费大量资源。为什么不直接枚举行并抓取中间的那行呢?
select d.*
from (select d.*, (@rn := @rn + 1) as rn
from data d cross join
(select @rn := 0) params
order by d.val
) d
where 2*rn in (@rn, @rn + 1);奇怪的where子句选择中间的值--如果有多少行,这只是一个近似值。因为您需要实际的行值,所以需要近似值。中位数本身的正常计算方法是:
select avg(d.val)
from (select d.*, (@rn := @rn + 1) as rn
from data d cross join
(select @rn := 0) params
order by d.val
) d
where 2*rn in (@rn - 1, @rn, @rn + 1);编辑:
同样的逻辑也适用于每个人,但需要更多的逻辑来获得总体计数:
select d.person, avg(val) as median
from (select d.*,
(@rn := if(@p = person, @rn + 1
if(@p := person, 1, 1)
) as rn
from data d cross join
(select @rn := 0, @p := '') params
order by person, d.val
) d join
(select person, count(*) as cnt
from data
group by person
) p
on d.person = p.person
where 2*rn in (d.cnt - 1, d.cnt, d.cnt + 1)
group by person;https://stackoverflow.com/questions/33768889
复制相似问题