我将Excel中的数据读取到Pandas DataFrame中,这样每一列都表示不同的变量,每一行都表示不同的示例。我使用下面的函数来识别DataFrame中潜在的高度相关变量,在调用该函数时,“高相关性”由给定的“阈值”输入来确定。
import pandas as pd
def find_highly_correlated_variables(dataframe, threshold):
'''
Parameters
----------
dataframe : pandas.DataFrame
threshold : float, representing minimal absolute value for correlation between variables to be selected
Output
------
string : reading how no highly correlated variables have been found, if none have been found
list : containing pair(s) of highly correlated variables if one or more have been found
'''
# Initialization of variables and lists to work with.
df = dataframe
th = threshold
column_names = list(df.columns.values)
highly_correlated_indices = []
highly_correlated_variables = []
# Correlation matrix is created, so that correlation values can be accessed easily.
correlation_array = df.corr().values.tolist()
for i_column, column in enumerate(correlation_array):
for i_element, element in enumerate(column):
if (abs(element) >= th) & (abs(element) != 1.0):
# Prevent duplicate information from being added.
if [i_column, i_element] not in highly_correlated_indices:
highly_correlated_indices.append([i_element, i_column])
# 'Translate' element and column indices into the variable names.
for indices in highly_correlated_indices:
highly_correlated_variables.append([column_names[indices[0]], column_names[indices[1]]])
if len(highly_correlated_indices) == 0:
print("No highly correlated variables found.")
else:
return highly_correlated_variables我知道嵌套的for循环在时间复杂度方面并不理想,所以我尝试使用'zip‘函数来解决它,并以某种方式这样做:for index, (column, element) in enumerate(zip(correlation_array, column),尽管我一直在试图使这样的解决方案工作。
出于这个原因,我很好奇是否有可能对代码的这一部分进行改进,以便与现在的代码相比,它能够加快进程的速度。
当然,我不介意听到其他的改进建议(例如,一些部件可以更紧凑),所以请不要犹豫,与我分享这样的想法。
发布于 2023-06-01 23:03:34
嵌套已解释
for循环在时间复杂度方面并不理想。
你的意思是“过去的时间”。
但是侧写会发现,正如您所观察到的,用于解释字节码的时间倾向于支配numpy编译的C代码的时间。
highly_correlated_indices = []
...
for i_column, column in ...:
for i_element, element in ...:
if (abs(element) >= th) & (abs(element) != 1.0):
if [i_column, i_element] not in highly_correlated_indices:
highly_correlated_indices.append(...)乍一看,这看起来是二次型的。但更糟的是,它是立方的。您需要的不是list,而是set,因此in测试可以在O(1)恒定时间内完成。
您是对的,与其解释器一次检查一个值,不如在矩阵中执行向量化广播。以下是一种方法:
import numpy as np
import pandas as pd
rng = np.random.default_rng(1)
df = pd.DataFrame(rng.random((8, 3)))
c = df.corr().abs()
th = .4
z = c[(c != 1) & (c > th)]
>>> np.round(z, 2)
0 1 2
0 NaN NaN NaN
1 NaN NaN 0.44
2 NaN 0.44 NaN我发现到两个地方转转很方便,但显然你不必这样做。
此时,您可以很容易地迭代正列:
>>> np.round(z.sum(axis=0), 2)
0 0.00
1 0.44
2 0.44https://codereview.stackexchange.com/questions/285293
复制相似问题