通过我的机器学习项目,这似乎有多个用途,它可以计数重复,也可以用作特征提取,幸运的是,可以用于数字和类别,Ridit Analysys
我的数据似乎有很多重复,我想检查一下。这是我的数据
No feature_1 feature_2 feature_3
1. 67 45 56
2. 67 40 56
3. 67 40 51这是我想要的
No feature_1 feature_2 feature_3 duplication_1 duplication_2 duplication_3
1. 67 45 56 3 1 2
2. 67 40 56 3 2 2
3. 67 40 51 3 2 1我所做的是
df1 = df.groupby(['feature_1']).size().reset_index()
df1.columns = ['customer_id', 'duplication_1']
df = df.merge(df1, on='customer_id', how='left')
df2 = df.groupby(['feature_2']).size().reset_index()
df2.columns = ['customer_id', 'duplication_2']
df = df.merge(df2, on='customer_id', how='left')
df3 = df.groupby(['feature_3']).size().reset_index()
df3.columns = ['customer_id', 'duplication_3']
df = df.merge(df3, on='customer_id', how='left')但我正在寻找更好的替代方法,特别是当我们有大量的功能时
发布于 2018-07-24 17:24:15
对每一列使用带有value_counts或transform的map:
for i, x in enumerate(df.columns):
df['duplication_{}'.format(i + 1)] = df[x].map(df[x].value_counts())
#alternative
#df['duplication_{}'.format(i + 1)] = df.groupby(x)[x].transform('size')
print (df)
feature_1 feature_2 feature_3 duplication_1 duplication_2 \
No
1.0 67 45 56 3 1
2.0 67 40 56 3 2
3.0 67 40 51 3 2
duplication_3
No
1.0 2
2.0 2
3.0 1 https://stackoverflow.com/questions/51494967
复制相似问题