有什么建议吗?这是我的数据与申请号码,申请人,性别,申请的标题和多个关键字的申请。
import pandas as pd
df = pd.DataFrame({'Application number': [1,1,1,1,1,1,2,2,2,2,2,3,3,3],
'Applicant': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson','John Johnson','Ryan Rosling','Ryan Rosling','Ryan Rosling','Ryan Rosling','Ryan Rosling','Gee Gettysbeurg','Gee Gettysbeurg','Gee Gettysbeurg'],
'Gender': ['m','m','m','m','m','m','m','m','m','m','m','v','v','v'],
'Title':['Symplectic symmetry','Symplectic symmetry','Symplectic symmetry','Symplectic symmetry','Symplectic symmetry','Symplectic symmetry', 'Exploring protein synthesis', 'Exploring protein synthesis', 'Exploring protein synthesis', 'Exploring protein synthesis', 'Exploring protein synthesis', 'auxin-response pathway','auxin-response pathway','auxin-response pathway' ],
'Keyword':['Maths', 'trignometry', 'Analysis', 'Algebra', 'Hypothesis', 'DNA', 'mitochondria', 'RNA', 'Brain', 'Cellstructure', 'Life', 'Blood', 'analysis', 'blood circulation'] })
df.head()I用来通过以下代码重新定位表,使关键字出现在每个申请者的单个单元格中:
df_pv2=pd.pivot_table(df,index=['Application number', 'Applicant', 'Gender', 'Title'],columns=df.groupby(['Application number', 'Applicant', 'Gender', 'Title']).cumcount().add(1),values=['Keyword'], aggfunc='sum')
df_pv2.columns=df_pv2.columns.map('{0[0]}{0[1]}'.format) ,但现在它没有重新定位关键字列。为了完整起见,在下面的代码中清理表:删除不需要的列
#joining all keywords together, seperated by comma
df_pv2['Keywords'] = df_pv2.loc[:, 'Keyword1':'Keyword6'].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
df_pv2=df_pv2.drop(df_pv2.loc[:, 'Keyword1':'Keyword6'].columns, axis = 1)
#resetting index
df_pv2=df_pv2.reset_index()
df_pv2.head() 发布于 2021-08-24 15:16:13
aggfunc='sum'只对数值进行操作,因此,以下代码:
df_pv2 = pd.pivot_table(df,
index=['Application number', 'Applicant', 'Gender',
'Title'],
columns=df.groupby(
['Application number', 'Applicant', 'Gender',
'Title']
).cumcount().add(1),
values=['Keyword'],
aggfunc='sum')生产:
Empty DataFrame
Columns: []
Index: [(1, John Johnson, m, Symplectic symmetry), (2, Ryan Rosling, m, Exploring protein synthesis), (3, Gee Gettysbeurg, v, auxin-response pathway)]这就是结果DataFrame没有关键字的原因。
一个解决办法是使用aggfunc='first'
df_pv2 = pd.pivot_table(df,
index=['Application number', 'Applicant', 'Gender',
'Title'],
columns=df.groupby(
['Application number', 'Applicant', 'Gender',
'Title']
).cumcount().add(1),
values=['Keyword'],
aggfunc='first')
df_pv2.columns = df_pv2.columns.map('{0[0]}{0[1]}'.format)
# joining all keywords together, seperated by comma
df_pv2['Keywords'] = df_pv2.loc[:, 'Keyword1':'Keyword6'].apply(
lambda x: ', '.join(x[x.notnull()]), axis=1)
df_pv2 = df_pv2.drop(df_pv2.loc[:, 'Keyword1':'Keyword6'].columns, axis=1)
# resetting index
df_pv2 = df_pv2.reset_index()其结果是:
Application number Applicant Gender Title Keywords
0 1 John Johnson m Symplectic symmetry Maths, trignometry, Analysis, Algebra, Hypothesis, DNA
1 2 Ryan Rosling m Exploring protein synthesis mitochondria, RNA, Brain, Cellstructure, Life
2 3 Gee Gettysbeurg v auxin-response pathway Blood, analysis, blood circulation但是,可以使用一个简单的groupby agg从df直接获得这一点,而不需要pivot_table。
df_pv2 = df.groupby(
['Application number', 'Applicant', 'Gender', 'Title'],
)['Keyword'].agg(', '.join).reset_index(name='Keywords')df_pv2
Application number Applicant Gender Title Keywords
0 1 John Johnson m Symplectic symmetry Maths, trignometry, Analysis, Algebra, Hypothesis, DNA
1 2 Ryan Rosling m Exploring protein synthesis mitochondria, RNA, Brain, Cellstructure, Life
2 3 Gee Gettysbeurg v auxin-response pathway Blood, analysis, blood circulationhttps://stackoverflow.com/questions/68909651
复制相似问题