首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >根据不同列中的值的交集查找相似组

根据不同列中的值的交集查找相似组
EN

Stack Overflow用户
提问于 2017-02-24 04:11:45
回答 2查看 801关注 0票数 2

我有一个如下所示的df:

代码语言:javascript
复制
Group   Attribute

Cheese  Dairy
Cheese  Food
Cheese  Curd
Cow     Dairy
Cow     Food
Cow     Animal
Cow     Hair
Cow     Stomachs
Yogurt  Dairy
Yogurt  Food
Yogurt  Curd
Yogurt  Fruity

我想为每个组做的是根据属性的交集找到它最相似的组。我想要的最终形式是:

代码语言:javascript
复制
Group   TotalCount   LikeGroup   CommonWords  PCT

Cheese  3            Yogurt      3            100.0
Cow     5            Cheese      2            40.0
Yogurt  4            Cheese      4            75.0

我意识到这可能在一个问题上问得太多了。我可以做很多事情,但是我真的不知道如何计算属性的交集,即使是在一个Group和另一个Group之间。如果我能找到奶酪和酸奶之间的交叉点,我就会找到正确的方向。

是否有可能在数据帧内完成此操作?我可以看到制作几个列表,并在所有列表对之间进行交集,然后使用新的列表长度来获得百分比。

例如,对于酸奶:

代码语言:javascript
复制
>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cheese = ['Dairy','Food','Curd']

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt)
0.75

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs']

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt)
0.5

>>>max(Yogurt_Cheese,Yogurt_Cow)
0.75
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-02-24 05:50:29

我创建了我自己的示例数组的较小版本。

代码语言:javascript
复制
import pandas as pd 
from itertools import permutations

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute'])
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list.

unique_grp = df['Group'].unique() # get the unique groups 
unique_atr = df['Attribute'].unique() # get the unique attributes

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into
comp_df['CommonWords'] = 0 

for atr in unique_atr:
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple
    for comb in myl:
        comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry.

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0 / comp_df['TotalCount']).round()

对于我的样本数据,我得到了输出

代码语言:javascript
复制
    Group LikeGroup  CommonWords  TotalCount  PCT
0  cheese       cow            2           3   67
1  cheese    yogurt            3           3  100
2     cow    cheese            2           2  100
3     cow    yogurt            2           2  100
4  yogurt    cheese            3           4   75
5  yogurt       cow            2           4   50

这似乎是正确的。

票数 4
EN

Stack Overflow用户

发布于 2017-02-24 04:41:20

看起来你应该能够制定一个聚合策略来破解这一点。试着看看这些编码示例,并考虑如何在数据帧上构造键和聚合函数,而不是像示例中所示那样试图处理它。

尝试在您的python环境中运行它(它是使用Python2.7在Jupyter notebooks中创建的),看看它是否为您的代码提供了一些想法:

代码语言:javascript
复制
np.random.seed(10)    # optional .. makes sure you get same random
                      # numbers used in the original experiment
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                   'key2':['one','two','one','two','one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

df
group = df.groupby('key1')
group2 = df.groupby(['key1', 'key2'])
group2.agg(['count', 'sum', 'min', 'max', 'mean', 'std'])
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/42425273

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档