文章/答案/技术大牛

发布

社区首页 >问答首页 >在循环变量的同时获取产生最高/最低pearson相关性的变量。

问在循环变量的同时获取产生最高/最低pearson相关性的变量。
EN

Stack Overflow用户

提问于 2020-11-18 10:13:30

回答 1查看 88关注 0票数 1

我正努力做到以下几点：

我有一个dataframe，它有许多列，其中包含度量和一些维度，如country、device、name。这三个维度中的每一个都有一些唯一的值，用于在使用pd.corr()之前过滤数据。

为了证明我将使用泰坦尼克号数据集。

import seaborn as sns
df_test = sns.load_dataset('titanic')

for who in df_test['who'].unique():
    for where in df_test['embark_town'].unique():
        print(df_test[(df_test['who']==who)&(df_test['embark_town']==where)].corr())

从而产生不同的df_test['who'].nunique()*df_test['embark_town'].nunique()、9的pd.corr()相关性。

下面是一个例子：

         survived    pclass       age     sibsp     parch      fare  
survived    1.000000 -0.198092  0.062199 -0.046691 -0.071417  0.108706   
pclass     -0.198092  1.000000 -0.438377  0.008843 -0.015523 -0.485546   
age         0.062199 -0.438377  1.000000 -0.049317  0.077529  0.199062   
sibsp      -0.046691  0.008843 -0.049317  1.000000  0.464033  0.358680   
parch      -0.071417 -0.015523  0.077529  0.464033  1.000000  0.415207   
fare        0.108706 -0.485546  0.199062  0.358680  0.415207  1.000000   
adult_male       NaN       NaN       NaN       NaN       NaN       NaN   
alone       0.030464  0.133638 -0.022396 -0.629845 -0.506964 -0.411392

我正试图获得数据，以回答这个问题：

在我的设置中，每个变量之间的相关性最高/最低，输出可以是list、dict、df，如下所示：

output = {'highest_corr_survived_p_class':['who = man', 'embark_town = Southampton', 0.65],
         'lowest_corr_survived_p_class':['who = man', 'embark_town = Cherbourg',-0.32],
         'highest_corr_survived_age':['who = female', 'embark_town = Cherbourg',0.75],
         'lowest_corr_survived_age':['who = man', 'embark_town = Cherbourg',-0.3]
         ...
         'lowest_corr_alone_fare':['who = man', 'embark_town = Cherbourg',-0.7]}

我陷入困境的地方是找到一种很好的方法来创建这些数据，以及如何将其放置在df中。

我试过的是：

output = {}

for who in df_test['who'].dropna().unique():
    for where in df_test['embark_town'].dropna().unique():
        output[f'{who}_{where}_corr'] =  df_test[(df_test['who']==who)&(df_test['embark_town']==where)].corr().loc['survived','pclass']

它产生output

{'man_Southampton_corr': -0.19809207465001574,
 'man_Cherbourg_corr': -0.2102998217386208,
 'man_Queenstown_corr': 0.06717166132798494,
 'woman_Southampton_corr': -0.5525868192717193,
 'woman_Cherbourg_corr': -0.5549942419871897,
 'woman_Queenstown_corr': -0.16896381511084563,
 'child_Southampton_corr': -0.5086941796202842,
 'child_Cherbourg_corr': -0.2390457218668788,
 'child_Queenstown_corr': nan}

这种方法并不关心什么是max或min相关性，这是我的目标。

我不确定如何使用loc[]在列之间添加所有可能的变体，或者是否有更好/更简单的方法将所有内容放置到df中

python

pandas

correlation

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-18 12:13:37

您可以将DataFrameGroupBy.corr与DataFrame.stack结合使用，删除1和-1行，并通过DataFrameGroupBy.idxmax获取每组的最大和最小值，以Series.loc表示选择的DataFrameGroupBy.idxmin，以concat连接在一起，最后使用字典理解作为最终dict。

import seaborn as sns
df_test = sns.load_dataset('titanic')
# print (df_test)

s = df_test.groupby(['who','embark_town']).corr().stack()
s = s[~s.isin([1, -1])]
s = (pd.concat([s.loc[s.groupby(level=[2,3]).idxmax()], 
                s.loc[s.groupby(level=[2,3]).idxmin()]], keys=('highest','lowest'))
       .sort_index(level=[3,4], sort_remaining=False))
print (s)
         who    embark_town                  
highest  child  Queenstown   age       alone     0.877346
lowest   woman  Queenstown   age       alone    -0.767493
highest  woman  Queenstown   age       fare      0.520461
lowest   child  Queenstown   age       fare     -0.877346
highest  woman  Queenstown   age       parch     0.633627
  
lowest   woman  Queenstown   survived  parch    -0.433029
highest  man    Queenstown   survived  pclass    0.067172
lowest   woman  Cherbourg    survived  pclass   -0.554994
highest  man    Queenstown   survived  sibsp     0.232685
lowest   child  Southampton  survived  sibsp    -0.692578
Length: 84, dtype: float64

output = {f'{k[0]}_corr_{k[3]}_{k[4]}':
          [f'who = {k[1]}', f'embark_town = {k[2]}',v] for k, v in s.items()}

print(output)

编辑:对于TOP3和BOTTOM3，可以进行排序，并使用GroupBy.head和GroupBy.tail

import seaborn as sns
df_test = sns.load_dataset('titanic')
# print (df_test)

s = df_test.groupby(['who','embark_town']).corr().stack()
s = s[~s.isin([1, -1])].sort_values()

s = (pd.concat([s.groupby(level=[2,3]).head(3), 
                s.groupby(level=[2,3]).tail(3)], keys=('highest','lowest'))
        .sort_index(level=[3,4], sort_remaining=False)
        )
print (s)
         who    embark_town                 
highest  woman  Queenstown   age       alone   -0.767493
                Cherbourg    age       alone   -0.073881
         man    Queenstown   age       alone   -0.069001
lowest   child  Southampton  age       alone    0.169244
                Cherbourg    age       alone    0.361780
  
highest  woman  Southampton  survived  sibsp   -0.252524
         man    Southampton  survived  sibsp   -0.046691
lowest   man    Cherbourg    survived  sibsp    0.125276
         woman  Queenstown   survived  sibsp    0.143025
         man    Queenstown   survived  sibsp    0.232685
Length: 252, dtype: float64

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64891010

复制

相似问题

问在循环变量的同时获取产生最高/最低pearson相关性的变量。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在循环变量的同时获取产生最高/最低pearson相关性的变量。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在循环变量的同时获取产生最高/最低pearson相关性的变量。
EN