我有一个包含30个Mio行的数据集,表示4000种不同的时间序列。现在,我想要创建100个包含移位值的新列(增量值1d之前,2d之前等等)。
数据/ MWE
import pandas as pd
df = pd.DataFrame({"dtime":["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05",
"2022-01-06", "2022-01-07", "2022-01-08", "2022-01-09", "2022-01-10"],
"value":[5,6,1,2,3,6,8,7,2,3]})
+----+------------+---------+
| | dtime | value |
|----+------------+---------|
| 0 | 2022-01-01 | 5 |
| 1 | 2022-01-02 | 6 |
| 2 | 2022-01-03 | 1 |
| 3 | 2022-01-04 | 2 |
| 4 | 2022-01-05 | 3 |
| 5 | 2022-01-06 | 6 |
| 6 | 2022-01-07 | 8 |
| 7 | 2022-01-08 | 7 |
| 8 | 2022-01-09 | 2 |
| 9 | 2022-01-10 | 3 |
+----+------------+---------+解决办法(但无效)
我可以一次一次地完成它,但由于我需要的是真实世界数据100 (甚至更多)列,所以我不想这样做。据我所知,熊猫一定有办法用一只、两只或三只熊猫:-)
df["1d"] = df["value"] - df["value"].shift(1)
df["2d"] = df["value"] - df["value"].shift(2)
df["3d"] = df["value"] - df["value"].shift(3)
df["4d"] = df["value"] - df["value"].shift(4)预期产出
+----+------------+---------+------+------+------+------+
| | dtime | value | 1d | 2d | 3d | 4d |
|----+------------+---------+------+------+------+------|
| 0 | 01.01.2022 | 5 | nan | nan | nan | nan |
| 1 | 02.01.2022 | 6 | 1 | nan | nan | nan |
| 2 | 03.01.2022 | 1 | -5 | -4 | nan | nan |
| 3 | 04.01.2022 | 2 | 1 | -4 | -3 | nan |
| 4 | 05.01.2022 | 3 | 1 | 2 | -3 | -2 |
| 5 | 06.01.2022 | 6 | 3 | 4 | 5 | 0 |
| 6 | 07.01.2022 | 8 | 2 | 5 | 6 | 7 |
| 7 | 08.01.2022 | 7 | -1 | 1 | 4 | 5 |
| 8 | 09.01.2022 | 2 | -5 | -6 | -4 | -1 |
| 9 | 10.01.2022 | 3 | 1 | -4 | -5 | -3 |
+----+------------+---------+------+------+------+------+发布于 2022-03-19 11:06:28
我认为一个简单的for循环就足够了:
for i in range(1, 5):
df[f'{i}d'] = df['value'].diff(i)print(df)
dtime value 1d 2d 3d 4d
0 2022-01-01 5 NaN NaN NaN NaN
1 2022-01-02 6 1.0 NaN NaN NaN
2 2022-01-03 1 -5.0 -4.0 NaN NaN
3 2022-01-04 2 1.0 -4.0 -3.0 NaN
4 2022-01-05 3 1.0 2.0 -3.0 -2.0
5 2022-01-06 6 3.0 4.0 5.0 0.0
6 2022-01-07 8 2.0 5.0 6.0 7.0
7 2022-01-08 7 -1.0 1.0 4.0 5.0
8 2022-01-09 2 -5.0 -6.0 -4.0 -1.0
9 2022-01-10 3 1.0 -4.0 -5.0 -3.0发布于 2022-03-19 12:46:40
如果你关心效率,你可以使用numpy。
以下是一种方法:
N = 4
# convert the column to array
a = df['value'].to_numpy(dtype='float')
# craft an indexing ndarray
b = np.tile(np.arange(len(a)), (N,1)).T-np.arange(1,N+1)
# slice and compute the difference
c = a[:,None]-a[b]
# mask the upper triangle
c[b<0] = np.nan
# assign back to DataFrame
df2 = df.join(pd.DataFrame(c))产出:
dtime value 0 1 2 3
0 2022-01-01 5 NaN NaN NaN NaN
1 2022-01-02 6 1.0 NaN NaN NaN
2 2022-01-03 1 -5.0 -4.0 NaN NaN
3 2022-01-04 2 1.0 -4.0 -3.0 NaN
4 2022-01-05 3 1.0 2.0 -3.0 -2.0
5 2022-01-06 6 3.0 4.0 5.0 0.0
6 2022-01-07 8 2.0 5.0 6.0 7.0
7 2022-01-08 7 -1.0 1.0 4.0 5.0
8 2022-01-09 2 -5.0 -6.0 -4.0 -1.0
9 2022-01-10 3 1.0 -4.0 -5.0 -3.0注意:这比提供的数据集上的循环运行速度快5-6倍。
https://stackoverflow.com/questions/71537565
复制相似问题