文章/答案/技术大牛

发布

社区首页 >问答首页 >从熊猫数据中日期不同的前几行中提取值

问从熊猫数据中日期不同的前几行中提取值
EN

Stack Overflow用户

提问于 2019-09-19 08:56:22

回答 3查看 162关注 0票数 3

我的previous question的扩展。我有一个源dataframe，它有三列: Customer、Date和Item。我想要添加一个包含项目历史记录的新列，它是该客户的所有项目的数组，这些项位于较早的(按日期定义)行中。如果客户在同一日期进行了多次购买，则应在另一行的项目历史记录中列出任何一行的项目。

因此，考虑到这个样本数据：

df = pd.DataFrame({'Customer':['Bert', 'Bert', 'Bert', 'Bert', 'Bert', 'Ernie', 'Ernie', 'Ernie', 'Ernie', 'Steven', 'Steven'], 'Date':['01/01/2019', '15/01/2019', '20/01/2019', '20/01/2019', '22/01/2019', '01/01/2019', '15/01/2019', '20/01/2019', '22/01/2019', '01/01/2019' ,'15/01/2019'], 'Item':['Bread', 'Cheese', 'Apples', 'Pears', 'Toothbrush', 'Toys', 'Shellfish', 'Dog', 'Yoghurt', 'Toilet', 'Dominos']})

Customer    Date    Item
Bert    01/01/2019  Bread
Bert    15/01/2019  Cheese
Bert    20/01/2019  Apples
Bert    20/01/2019  Pears
Bert    22/01/2019  Toothbrush
Ernie   01/01/2019  Toys
Ernie   15/01/2019  Shellfish
Ernie   20/01/2019  Dog
Ernie   22/01/2019  Yoghurt
Steven  01/01/2019  Toilet
Steven  15/01/2019  Dominos

我希望看到的输出是：

Customer    Date    Item        Item History
Bert    01/01/2019  Bread       NaN
Bert    15/01/2019  Cheese      [Bread]
Bert    20/01/2019  Apples      [Bread, Cheese]
Bert    20/01/2019  Pears       [Bread, Cheese]
Bert    22/01/2019  Toothbrush  [Bread, Cheese, Apples, Pears]
Ernie   01/01/2019  Toys        NaN
Ernie   15/01/2019  Shellfish   [Toys]
Ernie   20/01/2019  Dog         [Toys, Shellfish]
Ernie   22/01/2019  Yoghurt     [Toys, Shellfish, Dog]
Steven  01/01/2019  Toilet      NaN
Steven  15/01/2019  Dominos     [Toilet]

请注意，对于伯特在2019年1月20日购买的商品，两家公司的“历史”一栏都不包含对方的商品。对于他的22/01/2019购买，这两个项目从20/01/2019年包括在内。

对前一个问题的回答是一种巧妙的列表理解，其形式如下：

df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer') 
                                          for i in range(len(x))]

df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan

但是很明显，x.Item[:i]中的"i“需要计算出日期与当前行不相同的最后一行。任何关于实现这一目标的建议都是非常感谢的。

python

pandas

dataframe

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-09-19 09:21:59

Idea是通过DataFrame.duplicated区分每个组重复的值，然后用正向填充缺失值替换为NaN的值。

每个组的第一个值总是空字符串，因此不必替换每个组：

df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer') 
                                          for i in range(len(x))]

df['Item History'] = df['Item History'].mask(df.duplicated(['Customer','Date'])).ffill()

df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
print (df)
   Customer        Date        Item                    Item History
0      Bert  01/01/2019       Bread                             NaN
1      Bert  15/01/2019      Cheese                         [Bread]
2      Bert  20/01/2019      Apples                 [Bread, Cheese]
3      Bert  20/01/2019       Pears                 [Bread, Cheese]
4      Bert  22/01/2019  Toothbrush  [Bread, Cheese, Apples, Pears]
5     Ernie  01/01/2019        Toys                             NaN
6     Ernie  15/01/2019   Shellfish                          [Toys]
7     Ernie  20/01/2019         Dog               [Toys, Shellfish]
8     Ernie  22/01/2019     Yoghurt          [Toys, Shellfish, Dog]
9    Steven  01/01/2019      Toilet                             NaN
10   Steven  15/01/2019     Dominos                        [Toilet]

票数 2

Stack Overflow用户

发布于 2019-09-19 09:15:46

apply和np.cumsum()的另一种方式

#aggregates Item as list per 'Customer'& 'Date'
m=df.groupby(['Customer','Date'])['Item'].apply(lambda x: 
                               [*itertools.chain.from_iterable([x])])
#groups each Customer and cumsum the list with shift
n=m.groupby(level=0).apply(lambda x:np.cumsum(x).shift())
df.set_index(['Customer','Date']).assign(Item=n).reset_index() #assign back

   Customer        Date                            Item
0      Bert  01/01/2019                             NaN
1      Bert  15/01/2019                         [Bread]
2      Bert  20/01/2019                 [Bread, Cheese]
3      Bert  20/01/2019                 [Bread, Cheese]
4      Bert  22/01/2019  [Bread, Cheese, Apples, Pears]
5     Ernie  01/01/2019                             NaN
6     Ernie  15/01/2019                          [Toys]
7     Ernie  20/01/2019               [Toys, Shellfish]
8     Ernie  22/01/2019          [Toys, Shellfish, Dog]
9    Steven  01/01/2019                             NaN
10   Steven  15/01/2019                        [Toilet]

票数 2

Stack Overflow用户

发布于 2019-09-19 10:23:40

使用apply可能更简单的答案--这可能比其他方法慢：

df['item history'] = df.apply(lambda x: 
            [i for i in list(df.loc[(df.Date<x.Date)&(df.Customer==x.Customer),'Item'])], axis=1)

结果：

   Customer               ...                                  item history
0      Bert               ...                                            []
1      Bert               ...                                       [Bread]
2      Bert               ...                               [Bread, Cheese]
3      Bert               ...                               [Bread, Cheese]
4      Bert               ...                [Bread, Cheese, Apples, Pears]
5     Ernie               ...                                            []
6     Ernie               ...                                        [Toys]
7     Ernie               ...                             [Toys, Shellfish]
8     Ernie               ...                        [Toys, Shellfish, Dog]
9    Steven               ...                                            []
10   Steven               ...                                      [Toilet]

如果需要唯一项列表，则可能希望将list(set())添加到结果中。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58007296

复制

相似问题

问从熊猫数据中日期不同的前几行中提取值
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从熊猫数据中日期不同的前几行中提取值EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从熊猫数据中日期不同的前几行中提取值
EN