我需要创建一个新的列如下:
如果项目的频率大于或等于5,则设置为“畅销书”;如果项目的频率介于2(包括)和5之间,则设置“
假设我的数据集看起来像
Items Date
calzini 2020/02/23
cintura 2020/02/21
maglietta 2020/02/23
maglietta 2020/02/22
cappello 2020/02/23
jeans 2020/02/23
cappello 2020/02/22
maglietta 2020/02/22
maglietta 2020/02/22
jeans 2020/02/22
jeans 2020/02/23
maglietta 2020/02/23
jeans 2020/02/22
jeans 2020/02/23我想要
Items Category
calzini bad
cintura bad
maglietta best seller
maglietta best seller
jeans best seller
cappello ok
jeans best seller
cappello ok
maglietta best seller
maglietta best seller
jeans best seller
maglietta best seller
jeans best seller
jeans best seller我已确定这些项目的频率如下:
sold_items=df.groupby(['Items'])['Date'].count().sort_values(ascending=False) # the items should be counted overall, not using a specific Date! It is about how many items were sold 我想问您如何使用这些值创建一个新的列。
发布于 2020-06-26 19:35:18
您可以使用GroupBy.transform和np.select
vals = df['Items'].groupby(df['Items']).transform('count')
condlist = [vals.ge(5), (vals.ge(2) & vals.lt(5)) , vals.le(2)]
choicelist = ['best seller', 'ok', 'bad']
df.assign(category = np.select(condlist, choicelist))
Items Date category
0 calzini 2020/02/23 bad
1 cintura 2020/02/21 bad
2 maglietta 2020/02/23 best seller
3 maglietta 2020/02/22 best seller
4 cappello 2020/02/23 ok
5 jeans 2020/02/23 best seller
6 cappello 2020/02/22 ok
7 maglietta 2020/02/22 best seller
8 maglietta 2020/02/22 best seller
9 jeans 2020/02/22 best seller
10 jeans 2020/02/23 best seller
11 maglietta 2020/02/23 best seller
12 jeans 2020/02/22 best seller
13 jeans 2020/02/23 best seller发布于 2020-06-26 19:40:51
下面的代码应该有效。
df['category'] = pd.cut(df['sold_items'],bins = [0,1,4,df['sold_items'].max()],labels = ['bad','ok','best seller'])发布于 2020-06-26 19:38:39
您可以在value_counts上使用剪切:
pd.cut(df['Items'].value_counts(),bins=[0,1,4,10])
maglietta (4, 10]
jeans (4, 10]
cappello (1, 4]
calzini (0, 1]
cintura (0, 1]
Name: Items, dtype: category
Categories (3, interval[int64]): [(0, 1] < (1, 4] < (4, 10]]因此,这一削减不包括最低,因此圆括号在左边,并包括高列表,方括号在右边。现在,我们将这些标签转换为您需要的内容:
cats = pd.cut(df['Items'].value_counts(),bins=[0,1,4,10],labels=['bad','ok','best seller'])只需根据类别映射值,并使用.tonumpy()将其分配到一个新列(感谢@Ch3steR指出它,请参见注释):
df['Category'] = cats[df['Items']].to_numpy()
df
Items Date Category
0 calzini 2020/02/23 bad
1 cintura 2020/02/21 bad
2 maglietta 2020/02/23 best seller
3 maglietta 2020/02/22 best seller
4 cappello 2020/02/23 ok
5 jeans 2020/02/23 best seller
6 cappello 2020/02/22 ok
7 maglietta 2020/02/22 best seller
8 maglietta 2020/02/22 best seller
9 jeans 2020/02/22 best seller
10 jeans 2020/02/23 best seller
11 maglietta 2020/02/23 best seller
12 jeans 2020/02/22 best seller
13 jeans 2020/02/23 best seller您也可以使用df['Category'] = df['Items'].map(cats)
https://stackoverflow.com/questions/62601489
复制相似问题