正如标题所述,我正在试图找到一种方法来评估pyspark中的多重共线性?通常,我会使用statsmodel的VIF,但我在pyspark中看不到等效的函数。
任何关于我如何计算多重共线性的建议都将不胜感激。
发布于 2018-04-09 13:14:58
您可以获取相关矩阵:
from pyspark.mllib.stat import Statistics
seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0]) # a series
# seriesY must have the same number of partitions and cardinality as seriesX
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])
# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson")))
data = sc.parallelize(
[np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]
) # an RDD of Vectors
# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print(Statistics.corr(data, method="pearson"))文档:https://spark.apache.org/docs/latest/mllib-statistics.html
https://stackoverflow.com/questions/49723626
复制相似问题