我的previous question的扩展。我有一个源dataframe,它有三列: Customer、Date和Item。我想要添加一个包含项目历史记录的新列,它是该客户的所有项目的数组,这些项位于较早的(按日期定义)行中。如果客户在同一日期进行了多次购买,则应在另一行的项目历史记录中列出任何一行的项目。
因此,考虑到这个样本数据:
df = pd.DataFrame({'Customer':['Bert', 'Bert', 'Bert', 'Bert', 'Bert', 'Ernie', 'Ernie', 'Ernie', 'Ernie', 'Steven', 'Steven'], 'Date':['01/01/2019', '15/01/2019', '20/01/2019', '20/01/2019', '22/01/2019', '01/01/2019', '15/01/2019', '20/01/2019', '22/01/2019', '01/01/2019' ,'15/01/2019'], 'Item':['Bread', 'Cheese', 'Apples', 'Pears', 'Toothbrush', 'Toys', 'Shellfish', 'Dog', 'Yoghurt', 'Toilet', 'Dominos']})Customer Date Item
Bert 01/01/2019 Bread
Bert 15/01/2019 Cheese
Bert 20/01/2019 Apples
Bert 20/01/2019 Pears
Bert 22/01/2019 Toothbrush
Ernie 01/01/2019 Toys
Ernie 15/01/2019 Shellfish
Ernie 20/01/2019 Dog
Ernie 22/01/2019 Yoghurt
Steven 01/01/2019 Toilet
Steven 15/01/2019 Dominos我希望看到的输出是:
Customer Date Item Item History
Bert 01/01/2019 Bread NaN
Bert 15/01/2019 Cheese [Bread]
Bert 20/01/2019 Apples [Bread, Cheese]
Bert 20/01/2019 Pears [Bread, Cheese]
Bert 22/01/2019 Toothbrush [Bread, Cheese, Apples, Pears]
Ernie 01/01/2019 Toys NaN
Ernie 15/01/2019 Shellfish [Toys]
Ernie 20/01/2019 Dog [Toys, Shellfish]
Ernie 22/01/2019 Yoghurt [Toys, Shellfish, Dog]
Steven 01/01/2019 Toilet NaN
Steven 15/01/2019 Dominos [Toilet]请注意,对于伯特在2019年1月20日购买的商品,两家公司的“历史”一栏都不包含对方的商品。对于他的22/01/2019购买,这两个项目从20/01/2019年包括在内。
对前一个问题的回答是一种巧妙的列表理解,其形式如下:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan但是很明显,x.Item[:i]中的"i“需要计算出日期与当前行不相同的最后一行。任何关于实现这一目标的建议都是非常感谢的。
发布于 2019-09-19 09:21:59
Idea是通过DataFrame.duplicated区分每个组重复的值,然后用正向填充缺失值替换为NaN的值。
每个组的第一个值总是空字符串,因此不必替换每个组:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df['Item History'] = df['Item History'].mask(df.duplicated(['Customer','Date'])).ffill()
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
print (df)
Customer Date Item Item History
0 Bert 01/01/2019 Bread NaN
1 Bert 15/01/2019 Cheese [Bread]
2 Bert 20/01/2019 Apples [Bread, Cheese]
3 Bert 20/01/2019 Pears [Bread, Cheese]
4 Bert 22/01/2019 Toothbrush [Bread, Cheese, Apples, Pears]
5 Ernie 01/01/2019 Toys NaN
6 Ernie 15/01/2019 Shellfish [Toys]
7 Ernie 20/01/2019 Dog [Toys, Shellfish]
8 Ernie 22/01/2019 Yoghurt [Toys, Shellfish, Dog]
9 Steven 01/01/2019 Toilet NaN
10 Steven 15/01/2019 Dominos [Toilet]发布于 2019-09-19 09:15:46
apply和np.cumsum()的另一种方式
#aggregates Item as list per 'Customer'& 'Date'
m=df.groupby(['Customer','Date'])['Item'].apply(lambda x:
[*itertools.chain.from_iterable([x])])
#groups each Customer and cumsum the list with shift
n=m.groupby(level=0).apply(lambda x:np.cumsum(x).shift())
df.set_index(['Customer','Date']).assign(Item=n).reset_index() #assign back Customer Date Item
0 Bert 01/01/2019 NaN
1 Bert 15/01/2019 [Bread]
2 Bert 20/01/2019 [Bread, Cheese]
3 Bert 20/01/2019 [Bread, Cheese]
4 Bert 22/01/2019 [Bread, Cheese, Apples, Pears]
5 Ernie 01/01/2019 NaN
6 Ernie 15/01/2019 [Toys]
7 Ernie 20/01/2019 [Toys, Shellfish]
8 Ernie 22/01/2019 [Toys, Shellfish, Dog]
9 Steven 01/01/2019 NaN
10 Steven 15/01/2019 [Toilet]发布于 2019-09-19 10:23:40
使用apply可能更简单的答案--这可能比其他方法慢:
df['item history'] = df.apply(lambda x:
[i for i in list(df.loc[(df.Date<x.Date)&(df.Customer==x.Customer),'Item'])], axis=1)结果:
Customer ... item history
0 Bert ... []
1 Bert ... [Bread]
2 Bert ... [Bread, Cheese]
3 Bert ... [Bread, Cheese]
4 Bert ... [Bread, Cheese, Apples, Pears]
5 Ernie ... []
6 Ernie ... [Toys]
7 Ernie ... [Toys, Shellfish]
8 Ernie ... [Toys, Shellfish, Dog]
9 Steven ... []
10 Steven ... [Toilet]如果需要唯一项列表,则可能希望将list(set())添加到结果中。
https://stackoverflow.com/questions/58007296
复制相似问题