我有一个有点奇怪的熊猫群的问题。
我有一个源dataframe,它有三列: Customer、Date和Item。我想要添加一个包含项目历史记录的新列,它是该客户的所有项目的数组,这些项位于较早的(按日期定义)行中。例如,给定此源dataframe:
Customer Date Item
Bert 01/01/2019 Bread
Bert 15/01/2019 Cheese
Bert 20/01/2019 Apples
Bert 22/01/2019 Pears
Ernie 01/01/2019 Buzz Lightyear
Ernie 15/01/2019 Shellfish
Ernie 20/01/2019 A pet dog
Ernie 22/01/2019 Yoghurt
Steven 01/01/2019 A golden toilet
Steven 15/01/2019 Dominoes我想要创建这个历史特征:
Customer Date Item Item History
Bert 01/01/2019 Bread NaN
Bert 15/01/2019 Cheese [Bread]
Bert 20/01/2019 Apples [Bread, Cheese]
Bert 22/01/2019 Pears [Bread, Cheese, Apples]
Ernie 01/01/2019 Buzz Lightyear NaN
Ernie 15/01/2019 Shellfish [Buzz Lightyear]
Ernie 20/01/2019 A pet dog [Buzz Lightyear, Shellfish]
Ernie 22/01/2019 Yoghurt [Buzz Lightyear, Shellfish, A pet dog]
Steven 01/01/2019 A golden toilet NaN
Steven 15/01/2019 Dominoes [A golden toilet]我可以按日期进行以下操作以获得历史记录
df.groupby(['Customer', 'Date']).agg(lambda x: tuple(x)).applymap(list).reset_index()因此,如果一个客户在一天内购买了多个项目,它们都列在一个数组中,而客户只购买了一个单独在它自己的数组中的项目,但我不知道如何将它们与前面的行连接起来。
发布于 2019-09-17 12:39:37
使用自定义lambda函数和GroupBy.transform,最后将空列表替换为NaN:
f = lambda x: [x[:i].tolist() for i in range(len(x))]
df['Item History'] = df.groupby('Customer')['Item'].transform(f)另一种具有列表理解的解决方案:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nanprint (df)
Customer Date Item \
0 Bert 01/01/2019 Bread
1 Bert 15/01/2019 Cheese
2 Bert 20/01/2019 Apples
3 Bert 22/01/2019 Pears
4 Ernie 01/01/2019 Buzz Lightyear
5 Ernie 15/01/2019 Shellfish
6 Ernie 20/01/2019 A pet dog
7 Ernie 22/01/2019 Yoghurt
8 Steven 01/01/2019 A golden toilet
9 Steven 15/01/2019 Dominoes
Item History
0 NaN
1 [Bread]
2 [Bread, Cheese]
3 [Bread, Cheese, Apples]
4 NaN
5 [Buzz Lightyear]
6 [Buzz Lightyear, Shellfish]
7 [Buzz Lightyear, Shellfish, A pet dog]
8 NaN
9 [A golden toilet] 发布于 2020-01-30 10:30:00
我使用@jezrael的答案花了相当长的时间,但对于我所拥有的数据集大小,它的速度最终太慢了。为了改进这一点,我创建了一个执行相同操作的函数:
def buildItemHistoryPy(customers, items):
output = []
customer_ix = 0
for i in range(len(customers)):
if customers[i] == customers[i-1]:
output.append(items[customer_ix:i])
else:
customer_ix = i
output.append(items[customer_ix:i])
return output
df['Item History'] = buildItemHistoryPy(df.CustomerAccountNum.values, df.ItemId.values)我的意图是使用它作为Cython函数的基础(我预计它会快得多),但令我惊讶的是,裸python函数本身的速度要快得多。不管怎样,我继续做了这件事:
%%cython
import numpy as np
cimport numpy as np
cpdef list buildItemHistoryCy(np.ndarray customers, np.ndarray items):
cdef list output = []
cdef int customer_ix = 0
for i in range(len(customers)):
if customers[i] == customers[i-1]:
output.append(items[customer_ix:i])
else:
customer_ix = i
output.append(items[customer_ix:i])
return output从本质上说,这两种功能都是更快的,但Cython是最好的,数量适中:
%timeit -n5 df['Item History1'] = [x.ItemID[:i].tolist() for j, x in df.groupby('CustomerAccountNum') for i in range(len(x))]
%timeit -n5 df['Item History2'] = buildItemHistoryPy(df.CustomerAccountNum.values, df.ItemID.values)
%timeit -n5 df['Item History3'] = buildItemHistoryCy(df.CustomerAccountNum.values, df.ItemID.values)
7.46 s ± 346 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
53.5 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
23.6 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)我的要求略有变化,因此不再需要取消空列表。如果是这样的话,函数将不得不更改,这样您就需要追加items[customer_ix:i].tolist()。
https://stackoverflow.com/questions/57974133
复制相似问题