我有一个具有以下结构的数据帧
Debtor ID | AccountRating | AccountRatingDate | AmountOutstanding |AmountPastDue
John Snow Closed 2017-03-01 0 0
John Snow Delayed 2017-04-22 2000 500
John Snow Closed 2017-05-23 0 0
John Snow Delayed 2017-07-15 6000 300
Sarah Parker Closed 2017-02-01 0 0
Edward Hall Closed 2017-05-01 0 0
Douglas Core Delayed 2017-01-01 1000 200
Douglas Core Delayed 2017-06-01 1000 400我想要实现的是
Debtor ID | Incidents of delay | TheMostRecentOutstanding | TheMostRecentPastDue
John Snow 2 6000 300
Sarah Parker 0 0 0
Edward Hall 0 0 0
Douglas Core 2 1000 400计算延迟事件是非常简单的
df_account["pastDuebool"] = df_account['amtPastDue'] > 0
new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys())
new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum()我一直在努力争取最新的优秀和普渡的人才。我的代码是这样的
new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountOutstanding"]
new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountPastDue"]但它们返回具有所有NaN值的Series。请帮帮我,我在这里做错什么了?
发布于 2019-02-27 15:21:44
你可以试试这个:
df.sort_values('AccountRatingDate')\
.query('AccountRating == "Delayed"')\
.groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
.agg({'AccountRating':'count',
'AmountOutstanding':'last',
'AmountPastDue':'last'})\
.reindex(df['Debtor ID'].unique(), fill_value=0)\
.reset_index()输出:
Debtor ID AccountRating AmountOutstanding AmountPastDue
0 John Snow 2 6000 300
1 Sarah Parker 0 0 0
2 Edward Hall 0 0 0
3 Douglas Core 2 1000 400详细信息:
此外,还可以使用rename和字典进行列重命名:
df.sort_values('AccountRatingDate')\
.query('AccountRating == "Delayed"')\
.groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
.agg({'AccountRating':'count',
'AmountOutstanding':'last',
'AmountPastDue':'last'})\
.reindex(df['Debtor ID'].unique(), fill_value=0)\
.rename(columns={'AccoutRating':'Incidents of delay',
'AmountOutstanding':'TheMostRecentOutstanding',
'AmountPastDue':'TheMostRecentPastDue'})\
.reset_index()输出:
Debtor ID AccountRating TheMostRecentOutstanding TheMostRecentPastDue
0 John Snow 2 6000 300
1 Sarah Parker 0 0 0
2 Edward Hall 0 0 0
3 Douglas Core 2 1000 400发布于 2019-02-27 15:37:06
这将按日期对值进行排序(保留最后一个值),对布尔值'True‘值进行求和,然后在索引中删除重复值。然后,它将去掉不想要的列,并给出'new_df':
df = pd.read_table('fun.txt')
df['pastDuebool'] = df['AmountPastDue'] > 0
df = df.set_index('DebtorID').sort_values('AccountRatingDate')
df['Incidents of Delay'] = df.groupby('DebtorID')['pastDuebool'].sum()
df = df[~df.index.duplicated(keep='last')]
df = df.drop(['pastDuebool', 'AccountRatingDate', 'AccountRating'], axis=1)
new_df = df.rename(columns={'AmountOutstanding':'TheMostRecentOutstanding',
'AmountPastDue':'TheMostRecentPastDue'})
print(new_df)产出:
AmountOutstanding AmountPastDue Incidents of Delay
DebtorID
SarahParker 0 0 0.0
EdwardHall 0 0 0.0
DouglasCore 1000 400 2.0
JohnSnow 6000 300 2.0https://stackoverflow.com/questions/54908602
复制相似问题