我有一个面板数据集/时间序列。我想为明年的gcp准备机器学习预测的数据集。我的数据如下所示:
ID,year,age,area,debt_ratio,gcp
654001,2013,49,East,0.14,0
654001,2014,50,East,0.17,0
654001,2015,51,East,0.23,1
654001,2016,52,East,0.18,0
112089,2013,39,West,0.13,0
112089,2014,40,West,0.15,0
112089,2015,41,West,0.18,1
112089,2016,42,West,0.21,1我想要的是这样的:
ID,year,age,area,debt_ratio,gcp,gcp-1,gcp-2,gcp-3
654001,2013,49,East,0.14,0,NA,NA,NA
654001,2014,50,East,0.17,0,0,NA,NA
654001,2015,51,East,0.23,1,0,0,NA
654001,2016,52,East,0.18,0,1,0,0
112089,2013,39,West,0.13,0,NA,NA,NA
112089,2014,40,West,0.15,0,0,NA,NA
112089,2015,41,West,0.18,1,0,0,NA
112089,2016,42,West,0.21,1,1,0,0我试过熊猫熔化功能,但它不起作用。我在网上搜索,发现这篇文章正是我想要做的,但它是用R完成的:
https://stackoverflow.com/questions/19813077/prepare-time-series-for-machine-learning-long-to-wide-format有人知道如何在Python Pandas中做到这一点吗?任何建议都将不胜感激!
发布于 2019-10-08 20:10:42
在循环中使用DataFrameGroupBy.shift:
for i in range(1, 4):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0更动态的解决方案是获取最大组数并传递给range
N = df['ID'].value_counts().max()
for i in range(1, N):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0https://stackoverflow.com/questions/58285992
复制相似问题