所以我这里有个数据
>>> df
uniprot_id protein_group protein_family protein_subfamily
0 Q8TAS1 Other KIS NaN
1 P35916 TK VEGFR NaN
2 Q96SB4 CMGC SRPK NaN
3 Q6P3W7 Other SCY1 NaN
4 Q9UKI8 Other TLK NaN
.. ... ... ... ...
561 Q96S53 TKL LISK TESK
562 Q13163 STE STE7 NaN
563 P45985 STE STE7 NaN
564 Q5VT25 AGC DMPK GEK
565 O00141 AGC SGK NaNuniprot_id列中有一些重复的值,我希望将它们组合起来,使相同的值合并,但不同的值由分号分隔,因为这些重复的uniprot_id值的行相似但不相同。
在应用下面的代码之后,我没有得到我想要的结果,我想知道我做错了什么
df2 = df.groupby(['uniprot_id'])['protein_group','protein_family','protein_subfamily'].apply(lambda x: '; '.join(set(x))).reset_index()
>>> print(df2)
uniprot_id 0
0 A0A0B4J2F2 protein_subfamily; protein_family; protein_group
1 A4QPH2 protein_subfamily; protein_family; protein_group
2 B5MCJ9 protein_subfamily; protein_family; protein_group
3 O00141 protein_subfamily; protein_family; protein_group
4 O00238 protein_subfamily; protein_family; protein_group
.. ... ...
547 Q9Y616 protein_subfamily; protein_family; protein_group
548 Q9Y6E0 protein_subfamily; protein_family; protein_group
549 Q9Y6M4 protein_subfamily; protein_family; protein_group
550 Q9Y6R4 protein_subfamily; protein_family; protein_group
551 Q9Y6S9 protein_subfamily; protein_family; protein_group我需要重复的行来组合,并且看起来像这样
uniprot_id protein_group protein_family protein_subfamily
133 Q9UK32 Other RSK; RSKb RSKp90; RSKb发布于 2022-01-17 07:25:05
使用GroupBy.agg并通过Series.dropna删除缺失的值
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(set(x.dropna())))
.reset_index())
print (df2)
uniprot_id protein_group protein_family protein_subfamily
0 O00141 AGC SGK
1 P35916 TK VEGFR
2 P45985 STE STE7
3 Q13163 STE STE7
4 Q5VT25 AGC DMPK GEK
5 Q6P3W7 Other SCY1
6 Q8TAS1 Other KIS
7 Q96S53 TKL LISK TESK
8 Q96SB4 CMGC SRPK
9 Q9UKI8 Other TLK 如果订单很重要,不要使用set,因为有顺序没有定义,那么使用dict.fromkeys技巧:
df2 = (df.groupby(['uniprot_id'])[['protein_group','protein_family','protein_subfamily']]
.agg(lambda x: '; '.join(dict.fromkeys(x.dropna()).keys()))
.reset_index()) https://stackoverflow.com/questions/70737778
复制相似问题