首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从DataFrame中删除强相关列

从DataFrame中删除强相关列
EN

Stack Overflow用户
提问于 2018-03-14 15:42:36
回答 2查看 11.1K关注 0票数 3

我有一个像这样的DataFrame

代码语言:javascript
复制
dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]}
df = pd.DataFrame(dict_, columns=dict_.keys())

然后,我计算出列与过滤出的列之间的pearson相关性,这些相关值高于我的阈值0.95。

代码语言:javascript
复制
def trimm_correlated(df_in, threshold):
    df_corr = df_in.corr(method='pearson', min_periods=1)
    df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any()
    un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
    df_out = df_in[un_corr_idx]
    return df_out

产额

代码语言:javascript
复制
uncorrelated_factors = trimm_correlated(df, 0.95)
print uncorrelated_factors

    Col3
0   0.33
1   0.98
2   1.54
3   0.01
4   0.99

到目前为止,我对结果感到满意,但是我希望保留每个相关对的一列,所以在上面的示例中,我想包括Col1或Col2。去找s.th。像这样

代码语言:javascript
复制
    Col1   Col3
0    1     0.33
1    2     0.98
2    3     1.54
3    4     0.01
4    5     0.99

另外,我还可以对哪些相关列进行进一步的评估?

谢谢

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-03-14 16:18:53

您可以使用np.tril()而不是np.eye()作为掩码:

代码语言:javascript
复制
def trimm_correlated(df_in, threshold):
    df_corr = df_in.corr(method='pearson', min_periods=1)
    df_not_correlated = ~(df_corr.mask(np.tril(np.ones([len(df_corr)]*2, dtype=bool))).abs() > threshold).any()
    un_corr_idx = df_not_correlated.loc[df_not_correlated[df_not_correlated.index] == True].index
    df_out = df_in[un_corr_idx]
    return df_out

输出:

代码语言:javascript
复制
    Col1    Col3
0   1       0.33
1   2       0.98
2   3       1.54
3   4       0.01
4   5       0.99
票数 9
EN

Stack Overflow用户

发布于 2018-11-08 19:31:20

直接在dataframe上使用它来排序最高相关值。

代码语言:javascript
复制
import pandas as pd
import numpy as np
def correl(X_train):
    cor = X_train.corr()
    corrm = np.corrcoef(X_train.transpose())
    corr = corrm - np.diagflat(corrm.diagonal())
    print("max corr:",corr.max(), ", min corr: ", corr.min())
    c1 = cor.stack().sort_values(ascending=False).drop_duplicates()
    high_cor = c1[c1.values!=1]
    ## change this value to get more correlation results        
    thresh = 0.9
    display(high_cor[high_cor>thresh])
correl(X)
output:

max corr: 0.9821068918331252 , min corr:  -0.2993837739125243 

object at 0x0000017712D504E0>
count_rech_2g_8   sachet_2g_8         0.982107
count_rech_2g_7   sachet_2g_7         0.979492
count_rech_2g_6   sachet_2g_6         0.975892
arpu_8            total_rech_amt_8    0.946617
arpu_3g_8         arpu_2g_8           0.942428
isd_og_mou_8      isd_og_mou_7        0.938388
arpu_2g_6         arpu_3g_6           0.933158
isd_og_mou_6      isd_og_mou_8        0.931683
arpu_3g_7         arpu_2g_7           0.930460
total_rech_amt_6  arpu_6              0.930103
isd_og_mou_7      isd_og_mou_6        0.926571
arpu_7            total_rech_amt_7    0.926111
dtype: float64
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49282049

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档