我有这样的数据帧:
STOREID VARIANT_ARTICLE PO_DATE UNITSUM
0 st123 12345 20200427 9.0
1 st123 12345 20200428 3.0
2 st123 12345 20200429 13.0
3 st123 12345 20200430 7.0
4 st123 12345 20200501 16.0
5 st123 12345 20200502 3.0
6 st123 12345 20200503 5.0
7 st123 12345 20200504 10.0
8 st123 12345 20200505 3.0
9 st123 12345 20200506 7.0
10 st123 12345 20200507 29.0
11 st123 12345 20200508 4.0
12 st123 12345 20200509 9.0
13 st123 12345 20200510 8.0
14 st123 12345 20200511 5.0
15 st123 12345 20200513 8.0
16 st123 12345 20200514 2.0
17 st123 12345 20200515 2.0
18 st123 12345 20200516 2.0我想要计算UNITSUM列的rolling、sum和avg。这里的问题是,我需要计算过去4天(例如),而不是之前的4个记录,这实际上意味着,对于我的示例中的15th行,要聚合的日期范围是20200510 - 20200513。因为没有20200512的条目,所以我们在3个可用行上进行聚合,并且在计算中不包括20200509 (就像pandas在滚动函数中所做的那样)。
有没有办法做到这一点?
编辑:我必须使用dask-dataframe API来实现这一点。
发布于 2020-06-20 02:45:36
dask数据帧具有与pandas API相同的语法:
In [38]: ddf = dask.datasets.timeseries()
In [39]: ddf.head()
Out[39]:
id name x y
timestamp
2000-01-01 00:00:00 1003 George -0.287285 0.773949
2000-01-01 00:00:01 992 Oliver -0.738190 0.893916
2000-01-01 00:00:02 972 Jerry 0.080410 -0.972037
2000-01-01 00:00:03 970 George -0.402327 0.034718
2000-01-01 00:00:04 1034 Alice -0.694517 0.646178
In [40]: ddf.x.rolling(4).agg({'sum': 'sum', 'mean': 'mean'}).head()
Out[40]:
sum mean
timestamp
2000-01-01 00:00:00 NaN NaN
2000-01-01 00:00:01 NaN NaN
2000-01-01 00:00:02 NaN NaN
2000-01-01 00:00:03 -1.347393 -0.336848
2000-01-01 00:00:04 -1.754625 -0.438656https://stackoverflow.com/questions/62465202
复制相似问题