文章/答案/技术大牛

发布

社区首页 >问答首页 >从熊猫群中获得最新价值

问从熊猫群中获得最新价值
EN

Stack Overflow用户

提问于 2019-02-27 15:11:43

回答 2查看 1K关注 0票数 3

我有一个具有以下结构的数据帧

Debtor ID    | AccountRating    | AccountRatingDate   | AmountOutstanding    |AmountPastDue
John Snow      Closed             2017-03-01            0                     0
John Snow      Delayed            2017-04-22            2000                  500
John Snow      Closed             2017-05-23            0                     0
John Snow      Delayed            2017-07-15            6000                  300
Sarah Parker   Closed             2017-02-01            0                     0
Edward Hall    Closed             2017-05-01            0                     0
Douglas Core   Delayed            2017-01-01            1000                  200
Douglas Core   Delayed            2017-06-01            1000                  400

我想要实现的是

Debtor ID    | Incidents of delay    | TheMostRecentOutstanding    | TheMostRecentPastDue
John Snow      2                       6000                          300
Sarah Parker   0                       0                             0
Edward Hall    0                       0                             0
Douglas Core   2                       1000                          400

计算延迟事件是非常简单的

df_account["pastDuebool"] = df_account['amtPastDue'] > 0
new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys())
new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum()

我一直在努力争取最新的优秀和普渡的人才。我的代码是这样的

new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountOutstanding"]
new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountPastDue"]

但它们返回具有所有NaN值的Series。请帮帮我，我在这里做错什么了？

python

pandas

pandas-groupby

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-02-27 15:21:44

你可以试试这个：

df.sort_values('AccountRatingDate')\
  .query('AccountRating == "Delayed"')\
  .groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
  .agg({'AccountRating':'count',
        'AmountOutstanding':'last',
        'AmountPastDue':'last'})\
  .reindex(df['Debtor ID'].unique(), fill_value=0)\
  .reset_index()

输出：

      Debtor ID  AccountRating  AmountOutstanding  AmountPastDue
0     John Snow              2               6000            300
1  Sarah Parker              0                  0              0
2   Edward Hall              0                  0              0
3  Douglas Core              2               1000            400

详细信息：

首先由AccountRatingDate对数据进行排序，以获得最后一个日期作为最后一个记录。
将数据过滤到只有AccountRatings等于“延迟”的
Groupby债务人ID包含要聚合的列，然后使用agg与字典一起指示如何聚合每一列
使用债务人ID的唯一值重新编制索引，为那些没有任何延迟的用户填充零
还有重置索引。

此外，还可以使用rename和字典进行列重命名：

df.sort_values('AccountRatingDate')\
  .query('AccountRating == "Delayed"')\
  .groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
  .agg({'AccountRating':'count',
        'AmountOutstanding':'last',
        'AmountPastDue':'last'})\
  .reindex(df['Debtor ID'].unique(), fill_value=0)\
  .rename(columns={'AccoutRating':'Incidents of delay', 
                   'AmountOutstanding':'TheMostRecentOutstanding',
                   'AmountPastDue':'TheMostRecentPastDue'})\
  .reset_index()

输出：

      Debtor ID  AccountRating  TheMostRecentOutstanding  TheMostRecentPastDue
0     John Snow              2                      6000                   300
1  Sarah Parker              0                         0                     0
2   Edward Hall              0                         0                     0
3  Douglas Core              2                      1000                   400

票数 5

Stack Overflow用户

发布于 2019-02-27 15:37:06

这将按日期对值进行排序(保留最后一个值)，对布尔值'True‘值进行求和，然后在索引中删除重复值。然后，它将去掉不想要的列，并给出'new_df'：

df = pd.read_table('fun.txt')
df['pastDuebool'] = df['AmountPastDue'] > 0

df = df.set_index('DebtorID').sort_values('AccountRatingDate')
df['Incidents of Delay'] = df.groupby('DebtorID')['pastDuebool'].sum()
df = df[~df.index.duplicated(keep='last')]

df = df.drop(['pastDuebool', 'AccountRatingDate', 'AccountRating'], axis=1)
new_df = df.rename(columns={'AmountOutstanding':'TheMostRecentOutstanding',
                            'AmountPastDue':'TheMostRecentPastDue'})

print(new_df)

产出：

                 AmountOutstanding  AmountPastDue  Incidents of Delay
DebtorID
SarahParker                  0              0                 0.0
EdwardHall                   0              0                 0.0
DouglasCore               1000            400                 2.0
JohnSnow                  6000            300                 2.0

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54908602

复制

相似问题

问从熊猫群中获得最新价值
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从熊猫群中获得最新价值EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从熊猫群中获得最新价值
EN