Pandas版本: 1.1.0
你好,我正在尝试找出data_collected的最后两个日期之间的值的差异。虽然代码在处理连续日期时运行良好,但我被困在如何处理周末,因为没有收集到数据,留下了2天的空档。这会导致groupby.diff()忽略它们。
使用下面的代码,我能够分割出两个特定的日期,并使用.reindex_like()获得结果。
# (working example when specifically slicing on 2 dates)
prior_date = df.loc[df['date_collected'] == '2020-11-06']
current = df.loc[df['date_collected'] == '2020-11-09']
prior_date = prior_date.set_index('date')['value']
current = current.set_index('date')['value']
prior_date = prior_date.reindex_like(current).fillna(0)
df = (current - prior_date).reset_index()
change = df[df['value'] != 0].dropna(axis=0)但是,当我试图为整个数据帧模拟这些结果时,我找不到一种在多索引上使用reindex_like的方法。尝试使用.last(),但后来意识到错过周末成了一个问题。
# (current result down below)
chng = df.set_index(['date_collected', 'date'])['value']
chng = chng.groupby(level=1).diff().reset_index()
last = df.groupby('date_collected')[['date', 'value']].last().reset_index()
chng = chng.set_index(['date_collected', 'value'])
last = last.set_index(['date_collected', 'value'])
chng = chng.fillna(last)
chng = chng[chng['value'] != 0].dropna()# input data
+----------------+------------+-------+
| date_collected | date | value |
+----------------+------------+-------+
| 2020-11-06 | 2020-11-01 | 4 |
| 2020-11-06 | 2020-11-02 | 5 |
| 2020-11-06 | 2020-11-03 | 1 |
| 2020-11-06 | 2020-11-04 | 3 |
| 2020-11-06 | 2020-11-05 | 1 |
| 2020-11-09 | 2020-11-04 | 3 |
| 2020-11-09 | 2020-11-05 | 3 |
| 2020-11-09 | 2020-11-06 | 5 |
| 2020-11-09 | 2020-11-07 | 1 |
| 2020-11-09 | 2020-11-08 | 1 |
| 2020-11-10 | 2020-11-05 | 3 |
| 2020-11-10 | 2020-11-06 | 5 |
| 2020-11-10 | 2020-11-07 | 1 |
| 2020-11-10 | 2020-11-08 | 3 |
| 2020-11-10 | 2020-11-09 | 2 |
+----------------+------------+-------+
# wanted results
+----------------+------------+-------+
| date_collected | date | value |
+----------------+------------+-------+
| 2020-11-06 | 2020-11-05 | 1 |
| 2020-11-09 | 2020-11-05 | 2 |
| 2020-11-09 | 2020-11-06 | 5 |
| 2020-11-09 | 2020-11-07 | 1 |
| 2020-11-09 | 2020-11-08 | 1 |
| 2020-11-10 | 2020-11-08 | 2 |
| 2020-11-10 | 2020-11-09 | 2 |
+----------------+------------+-------+
# current results
+----------------+------------+-------+
| date_collected | date | value |
+----------------+------------+-------+
| 2020-11-06 | 2020-11-05 | 1 |
| 2020-11-09 | 2020-11-05 | 2 |
| 2020-11-09 | 2020-11-08 | 1 |
| 2020-11-10 | 2020-11-08 | 2 |
| 2020-11-10 | 2020-11-09 | 2 |
+----------------+------------+-------+发布于 2020-11-17 18:47:31
能够使用以下代码对多索引进行重新索引并获得所需的结果:
dates = pd.date_range(df['date'].min(), df['date'].max())
new_idx = pd.MultiIndex.from_product([df['date_collected'].unique(), dates])
df = df.set_index(['date_collected', 'date'])
df = df.reindex(new_idx).fillna(0)
chng = df.groupby(level=1).diff()
chng = chng[chng['value'] != 0].dropna()https://stackoverflow.com/questions/64857309
复制相似问题