我有一个timeseries数据集,如下所示:
uid | geom | date | count |
-------------------------------------
1 | FeatureA | 2016-02-01 | 1 |
2 | FeatureA | 2016-02-02 | 2 |
3 | FeatureA | 2016-02-03 | 3 |
4 | FeatureA | 2016-02-04 | 4 |
5 | FeatureA | 2016-02-05 | 5 |
6 | FeatureA | 2016-02-06 | 9 |
7 | FeatureA | 2016-02-07 | 11 |
8 | FeatureA | 2016-02-08 | 15 |
9 | FeatureA | 2016-02-09 | 17 |
10 | FeatureA | 2016-02-10 | 20 |
11 | FeatureB | 2016-02-01 | 2 |
12 | FeatureB | 2016-02-02 | 2 |
13 | FeatureB | 2016-02-03 | 8 |
14 | FeatureB | 2016-02-04 | 4 |
15 | FeatureB | 2016-02-05 | 5 |
16 | FeatureB | 2016-02-06 | 15 |
17 | FeatureB | 2016-02-07 | 11 |
18 | FeatureB | 2016-02-08 | 15 |
19 | FeatureB | 2016-02-09 | 19 |
20 | FeatureB | 2016-02-10 | 25 | 我想计算数据集中的每个特征的7天滚动平均值(~2000特征)。为了计算Postgres的7天滚动平均值,可以使用描述的这里窗口。
以下代码几乎有效:
SELECT geom,date,count,
AVG(count)
OVER(PARTITION BY geom ORDER BY geom, date ROWS BETWEEN CURRENT ROW AND 7 Following) AS rolling_avg_count
FROM features;这提供了以下输出:
geom | date | count | rolling_avg_count
--------------------------------------------------
FeatureA | 2016-02-01 | 1 | 6.25
FeatureA | 2016-02-02 | 2 | 8.25
FeatureA | 2016-02-03 | 3 | 10.5
FeatureA | 2016-02-04 | 4 | 11.57
FeatureA | 2016-02-05 | 5 | 12.83
FeatureA | 2016-02-06 | 9 | 14.4
FeatureA | 2016-02-07 | 11 | 15.75
FeatureA | 2016-02-08 | 15 | 17.33
FeatureA | 2016-02-09 | 17 | 18.5
FeatureA | 2016-02-10 | 20 | 20
FeatureB | 2016-02-01 | 2 | 7.75
FeatureB | 2016-02-02 | 2 | 9.875
FeatureB | 2016-02-03 | 8 | 12.75
FeatureB | 2016-02-04 | 4 | 13.43
FeatureB | 2016-02-05 | 5 | 15
FeatureB | 2016-02-06 | 15 | 17
FeatureB | 2016-02-07 | 11 | 17.5
FeatureB | 2016-02-08 | 15 | 19.67
FeatureB | 2016-02-10 | 25 | 25但是,输出将继续计算平均值,直到分区结束为止。例如,uid 10的滚动平均值为20 (仅使用一条记录计算)。如果后面的行少于7行,我希望计算停止。
理想情况下,输出应该如下所示:
geom | date | count | rolling_avg_count
--------------------------------------------------
FeatureA | 2016-02-01 | 1 | 6.25
FeatureA | 2016-02-02 | 2 | 8.25
FeatureA | 2016-02-03 | 3 | 10.5
FeatureA | 2016-02-04 | 4 | 11.57
FeatureA | 2016-02-05 | 5 |
FeatureA | 2016-02-06 | 9 |
FeatureA | 2016-02-07 | 11 |
FeatureA | 2016-02-08 | 15 |
FeatureA | 2016-02-09 | 17 |
FeatureA | 2016-02-10 | 20 |
FeatureB | 2016-02-01 | 2 | 7.75
FeatureB | 2016-02-02 | 2 | 9.875
FeatureB | 2016-02-03 | 8 | 12.75
FeatureB | 2016-02-04 | 4 | 13.43
FeatureB | 2016-02-05 | 5 |
FeatureB | 2016-02-06 | 15 |
FeatureB | 2016-02-07 | 11 |
FeatureB | 2016-02-08 | 15 |
FeatureB | 2016-02-10 | 25 | 发布于 2020-05-22 00:02:04
这将产生所需的结果--在调整了-1的错误后,2:
SELECT geom, date, count
, CASE WHEN rn < 7 THEN NULL -- ②
ELSE round(rolling_avg_count, 2) END AS rolling_avg_count
FROM (
SELECT geom, date, count
, AVG(count) OVER (PARTITION BY geom
ORDER BY date -- ①
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING -- ②
) AS rolling_avg_count
, row_number() OVER (PARTITION BY geom ORDER BY date DESC) AS rn
FROM features
) sub;db<>fiddle 这里
1使用PARTITION BY geom,您不需要在ORDER BY中使用geom。
你犯了一个错误(S):
我想计算一下7天的滚动平均值。
但是,您显示了8天滚动平均值(1 + 7)的计算。
如果后面的行少于7行,我希望计算停止。
但是您只显示了以下6行的结果(显示值11.57和13.43)。
我使它成为一个实际的7天平均,停止时,有不到6行以下。
由于第二个窗口函数中的降序排序顺序,上述内容为查询计划添加了一个额外的排序。
这个替代查询避免了使用附加的count():
SELECT geom, date, count
, CASE WHEN rn < 6 THEN NULL -- ③
ELSE round(rolling_avg_count, 2) END AS rolling_avg_count
FROM (
SELECT geom, date, count
, avg(count) OVER (PARTITION BY geom
ORDER BY date
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING
) AS rolling_avg_count
, count(*) OVER (PARTITION BY geom)
- row_number() OVER (PARTITION BY geom ORDER BY date) AS rn
FROM features
) sub;db<>fiddle 这里
再详细一点,添加一个额外的窗口函数。但我希望它能表现得更好。
3调整为0基行号。(比添加+ 1便宜。)
https://dba.stackexchange.com/questions/267610
复制相似问题