我有一个有几个副本的大数据表。目标是补充这些重复,以获得最大信息量的行。想象一下具有以下结构的熊猫数据框架:
>>> import pandas as pd
>>> import numpy as np
>>> d = {'id': [1,1,1,2,2,3], 'col1': [25,np.NaN,25,3,np.NaN,1], 'col2':[np.NaN,'rrt',np.NaN,np.NaN,'sdf',np.NaN], 'col3':[300,300,np.NaN,500,500,600]}
>>> df = pd.DataFrame(data=d)
>>> df
id col1 col2 col3
0 1 25.0 NaN 300.0
1 1 NaN rrt 300.0
2 1 25.0 NaN NaN
3 2 3.0 NaN 500.0
4 2 NaN sdf 500.0
5 3 1.0 NaN 600.0我希望取得以下结果:
id col1 col2 col3
0 1 25.0 rrt 300.0
1 1 25.0 NaN 300.0
2 1 NaN rrt 300.0
3 1 25.0 NaN NaN
4 2 3.0 sdf 500.0
5 2 3.0 NaN 500.0
6 2 NaN sdf 500.0
7 3 1.0 NaN 600.0(之后,我可以使用“How do I select the group with the least number of null values in a groupby?”线程的解决方案来获得。
id col1 col2 col3
0 1 25.0 rrt 300.0
1 2 3.0 sdf 500.0
3 3 1.0 NaN 600.0)
发布于 2022-10-14 18:19:29
你可以用这个。我希望这能帮到你
import pandas as pd
import numpy as np
d = {'id': [1,1,1,2,2,3], 'col1': [25,np.NaN,25,3,np.NaN,1], 'col2':[np.NaN,'rrt',np.NaN,np.NaN,'sdf',np.NaN], 'col3':[300,300,np.NaN,500,500,600]}
df1 = pd.DataFrame(data=d)
df2 = df1.groupby('id').agg({'col1': max, 'col3':max}).reset_index()
col = []
for index in range(len(df2)):
group_list = df1.groupby('id')['col2'].get_group(index + 1).tolist()
new_list = [item for item in group_list if not(pd.isnull(item)) == True]
if len(new_list)<1:
new_list.append(np.nan)
col.append(new_list[0])
df2['col2'] = col
final_df = pd.concat([df1, df2.dropna()], axis=0).sort_values(by=['id', 'col3']).reset_index().drop(columns='index')最后的数据是,
id col1 col2 col3
0 1 25.0 NaN 300.0
1 1 NaN rrt 300.0
2 1 25.0 rrt 300.0
3 1 25.0 NaN NaN
4 2 3.0 NaN 500.0
5 2 NaN sdf 500.0
6 2 3.0 sdf 500.0
7 3 1.0 NaN 600.0df2输出是,
id col1 col3 col2
0 1 25.0 300.0 rrt
1 2 3.0 500.0 sdf
2 3 1.0 600.0 NaNhttps://stackoverflow.com/questions/74072062
复制相似问题