你好,我有一个地区的数据,客户和一些交货。这一栏被用作购买类型,第一次和最后一次采购被标记为“第一次”和“最后一次”,有时我们有中间交货标记为“交付”。我需要将根本没有中间传递的客户和区域标记为,作为期望输出中的一列。在行中标记中间交货并不难,但整个组客户区域需要标记。
import pandas as pd
data = [['NY', 'A','FIRST', 10], ['NY', 'A','DELIVERY', 20], ['NY', 'A','DELIVERY', 30], ['NY', 'A','LAST', 25],
['NY', 'B','FIRST', 15], ['NY', 'B','DELIVERY', 10], ['NY', 'B','LAST', 20],
['FL', 'A','FIRST', 15], ['FL', 'A','DELIVERY', 10], ['FL', 'A','DELIVERY', 12], ['FL', 'A','DELIVERY', 25], ['FL', 'A','LAST', 20],
['FL', 'C','FIRST', 15], ['FL', 'C','LAST', 10],
['FL', 'D','FIRST', 10], ['FL', 'D','DELIVERY', 20], ['FL', 'D','LAST', 30],
['FL', 'E','FIRST', 20], ['FL', 'E','LAST', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['region', 'customer', 'purchaseType', 'price'])
# print dataframe.
df打印:
region customer purchaseType price
0 NY A FIRST 10
1 NY A DELIVERY 20
2 NY A DELIVERY 30
3 NY A LAST 25
4 NY B FIRST 15
5 NY B DELIVERY 10
6 NY B LAST 20
7 FL A FIRST 15
8 FL A DELIVERY 10
9 FL A DELIVERY 12
10 FL A DELIVERY 25
11 FL A LAST 20
12 FL C FIRST 15
13 FL C LAST 10
14 FL D FIRST 10
15 FL D DELIVERY 20
16 FL D LAST 30
17 FL E FIRST 20
18 FL E LAST 20期望输出:
region customer purchaseType price noDeliveryFlag
0 NY A FIRST 10 0
1 NY A DELIVERY 20 0
2 NY A DELIVERY 30 0
3 NY A LAST 25 0
4 NY B FIRST 15 0
5 NY B DELIVERY 10 0
6 NY B LAST 20 0
7 FL A FIRST 15 0
8 FL A DELIVERY 10 0
9 FL A DELIVERY 12 0
10 FL A DELIVERY 25 0
11 FL A LAST 20 0
12 FL C FIRST 15 1
13 FL C LAST 10 1
14 FL D FIRST 10 0
15 FL D DELIVERY 20 0
16 FL D LAST 30 0
17 FL E FIRST 20 1
18 FL E LAST 20 1非常感谢!
发布于 2020-11-28 00:02:46
我想我想出来了
df['noDeliveryFlag'] = df['purchaseType'] != 'DELIVERY'
df['noDeliveryFlag'] = df.groupby(['region','customer'])['noDeliveryFlag'].transform('min').astype(int)
print(df)如果有人有一个更有效的方式,我很感激。
发布于 2020-11-28 00:00:39
首先,我们按地区和客户计算出交货情况。为此,我们按区域、客户,然后在每个组中检查“传递”是否包含在purchaseType系列中该组中的部分中。如果没有交付,我们为组分配1,否则为0(在这里使用真假可能更自然,但坚持问题)
delivery_status = (df.groupby(['region', 'customer'], sort=False)['purchaseType']
.apply(lambda d: 1*('DELIVERY' not in d.values))
.rename('noDeliveryFlag')
)
delivery_status这会产生
region customer
NY A 0
B 0
FL A 0
C 1
D 0
E 1
Name: noDeliveryFlag, dtype: int64然后我们将其合并到原始df中。
(df.set_index(['region', 'customer'])
.join(delivery_status,how = 'left', sort=False)
.reset_index()
)要获得
region customer purchaseType price noDeliveryFlag
-- -------- ---------- -------------- ------- ----------------
0 FL A FIRST 15 0
1 FL A DELIVERY 10 0
2 FL A DELIVERY 12 0
3 FL A DELIVERY 25 0
4 FL A LAST 20 0
5 FL C FIRST 15 1
6 FL C LAST 10 1
7 FL D FIRST 10 0
8 FL D DELIVERY 20 0
9 FL D LAST 30 0
10 FL E FIRST 20 1
11 FL E LAST 20 1
12 NY A FIRST 10 0
13 NY A DELIVERY 20 0
14 NY A DELIVERY 30 0
15 NY A LAST 25 0
16 NY B FIRST 15 0
17 NY B DELIVERY 10 0
18 NY B LAST 20 0注意,解决方案没有检查第一次和最后一次之间是否有传递--它只是检查该区域/客户根本没有传递。
发布于 2020-11-28 00:09:21
您可以在groupby操作中使用transform和size。
这种方法假设只有2个purchaseTypes的人没有交付,它不考虑正在进行的交付。
df['noDeliveryFlag'] = np.where(df.groupby(['customer','region'])
['purchaseType'].transform('size').eq(2),1,0)
region customer purchaseType price noDeliveryFlag
0 NY A FIRST 10 0
1 NY A DELIVERY 20 0
2 NY A DELIVERY 30 0
3 NY A LAST 25 0
4 NY B FIRST 15 0
5 NY B DELIVERY 10 0
6 NY B LAST 20 0
7 FL A FIRST 15 0
8 FL A DELIVERY 10 0
9 FL A DELIVERY 12 0
10 FL A DELIVERY 25 0
11 FL A LAST 20 0
12 FL C FIRST 15 1
13 FL C LAST 10 1
14 FL D FIRST 10 0
15 FL D DELIVERY 20 0
16 FL D LAST 30 0
17 FL E FIRST 20 1
18 FL E DELIVERY 20 1https://stackoverflow.com/questions/65044916
复制相似问题