我有一个有6列的数据集,从中我让熊猫计算相关矩阵,结果如下:
age earnings height hours siblings weight
age 1.000000 0.026032 0.040002 0.024118 0.155894 0.048655
earnings 0.026032 1.000000 0.276373 0.224283 0.126651 0.092299
height 0.040002 0.276373 1.000000 0.235616 0.077551 0.572538
hours 0.024118 0.224283 0.235616 1.000000 0.067797 0.143160
siblings 0.155894 0.126651 0.077551 0.067797 1.000000 0.018367
weight 0.048655 0.092299 0.572538 0.143160 0.018367 1.000000例如,如果相关性大于0.5,但列不相等,我如何得到列的组合?因此,在这种情况下,输出需要如下所示:
[('height', 'weight')]我试着用for循环来实现它,但我认为这不是正确/最有效的方法:
correlated = []
for column1 in columns:
for column2 in columns:
if column1 != column2:
correlation = df[column1].corr(df[column2])
if correlation > 0.5 and (column2, column1) not in correlated:
correlated.append((column1, column2))其中df是我的原始数据。这将输出预期的结果:
[(u'height', u'weight')]发布于 2014-10-20 11:15:58
下面,使用numpy,并假设您已经在df中有了相关矩阵,怎么样?
import numpy as np
indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
if x != y and x < y]这将导致indices包含:
[('height', 'weight')]https://stackoverflow.com/questions/26463714
复制相似问题