文章/答案/技术大牛

发布

社区首页 >问答首页 >从零开始进行主成分分析的实现将数据定向不同于scikit-learn

问从零开始进行主成分分析的实现将数据定向不同于scikit-learn
EN

Stack Overflow用户

提问于 2021-06-11 06:23:30

回答 2查看 513关注 0票数 1

在指南用Sebastian Raschka在Python中实现PCA的基础上，我从零开始构建主成分分析算法，以达到我的研究目的。类的定义是：

import numpy as np

class PCA(object):
    """Dimension Reduction using Principal Component Analysis (PCA)

    It is the procces of computing principal components which explains the
    maximum variation of the dataset using fewer components.

    :type  n_components: int, optional
    :param n_components: Number of components to consider, if not set then
                         `n_components = min(n_samples, n_features)`, where
                         `n_samples` is the number of samples, and
                         `n_features` is the number of features (i.e.,
                         dimension of the dataset).

    Attributes
    ==========
        :type  covariance_: np.ndarray
        :param covariance_: Coviarance Matrix

        :type  eig_vals_: np.ndarray
        :param eig_vals_: Calculated Eigen Values

        :type  eig_vecs_: np.ndarray
        :param eig_vecs_: Calculated Eigen Vectors

        :type  explained_variance_: np.ndarray
        :param explained_variance_: Explained Variance of Each Principal Components

        :type  cum_explained_variance_: np.ndarray
        :param cum_explained_variance_: Cumulative Explained Variables
    """

    def __init__(self, n_components : int = None):
        """Default Constructor for Initialization"""

        self.n_components = n_components

    def fit_transform(self, X : np.ndarray):
        """Fit the PCA algorithm into the Dataset"""

        if not self.n_components:
            self.n_components = min(X.shape)

        self.covariance_ = np.cov(X.T)

        # calculate eigens
        self.eig_vals_, self.eig_vecs_ = np.linalg.eig(self.covariance_)

        # explained variance
        _tot_eig_vals = sum(self.eig_vals_)
        self.explained_variance_ = np.array([(i / _tot_eig_vals) * 100 for i in sorted(self.eig_vals_, reverse = True)])
        self.cum_explained_variance_ = np.cumsum(self.explained_variance_)

        # define `W` as `d x k`-dimension
        self.W_ = self.eig_vecs_[:, :self.n_components]

        print(X.shape, self.W_.shape)
        return X.dot(self.W_)

以虹膜数据集作为测试用例，实现了PCA并将其可视化如下：

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# loading iris data, and normalize
from sklearn.datasets import load_iris
iris = load_iris()

from sklearn.preprocessing import MinMaxScaler
X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)

# using the PCA function (defined above)
# to fit_transform the X value
# naming the PCA object as dPCA (d = defined)
dPCA = PCA()
principalComponents = dPCA.fit_transform(X)

# creating a pandas dataframe for the principal components
# and visualize the data using scatter plot
PCAResult = pd.DataFrame(principalComponents, columns = [f"PCA-{i}" for i in range(1, dPCA.n_components + 1)])
PCAResult["target"] = y # possible as original order does not change

sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult, hue = "target", s = 50)
plt.show()

产出如下：

现在，我希望验证输出，为此我使用了sklearn库，输出如下：

from sklearn.decomposition import PCA # note the same name
sPCA = PCA() # consider all the components

principalComponents_ = sPCA.fit_transform(X)
PCAResult_ = pd.DataFrame(principalComponents_, columns = [f"PCA-{i}" for i in range(1, 5)])
PCAResult_["target"] = y # possible as original order does not change

sns.scatterplot(x = "PCA-1", y = "PCA-2", data = PCAResult_, hue = "target", s = 50)
plt.show()

我不明白为什么输出是以不同的方向定向的，有一个小的不同的值。我研究了许多代码[1，2，3.]，所有这些代码都有相同的问题。我的问题：

在sklearn中有什么不同，情节是不同的？我也尝试过使用不同的数据集--同样的问题。
有办法解决这个问题吗？

我无法研究sklearn.decompose.PCA算法，因为我对python的OOPs概念很陌生。

在塞巴斯蒂安·拉什卡( Sebastian )的博客文章中，产出也有很小的变化。下图：

python

algorithm

numpy

scikit-learn

pca

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-06-11 12:52:14

在计算特征向量时，您可以使用改变其标志，并且解也是有效的。

因此，任何PCA轴都可以倒转，且解是有效的。

不过，您可能希望将PCA轴与dataset中的原始变量之一进行正相关，如果需要，则将其倒置。

票数 2

Stack Overflow用户

发布于 2021-06-11 14:09:44

不同的值来源于PCA和svd分解的学习方法。在sklearn中有一个函数svd_flip用来翻转PC，这解释了为什么你看到这个翻转

有关帮助页面的更多详细信息

它使用LAPACK实现全SVD或随机截断SVD的Halko等方法。2009年，取决于输入数据的形状和要提取的组件数量。

您可以阅读有关关系这里的内容。

我们首先运行您的示例数据集：

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA 
from sklearn.datasets import load_iris
from sklearn.utils.extmath import svd_flip
import pandas as pd
import numpy as np
import scipy

iris = load_iris()

X, y = iris.data, iris.target
X = MinMaxScaler().fit_transform(X)

n_components = 4

sPCA = PCA(n_components,svd_solver="full")
sklearnPCs = pd.DataFrame(sPCA.fit_transform(X))

我们现在在您的中心矩阵上执行SVD：

U,S,Vt = scipy.linalg.svd(X - X.mean(axis=0))
U = U[:,:n_components]
U, Vt = svd_flip(U, Vt)

svdPCs =  pd.DataFrame(U*S)

结果：

            0         1         2         3
0   -0.630703  0.107578 -0.018719 -0.007307
1   -0.622905 -0.104260 -0.049142 -0.032359
2   -0.669520 -0.051417  0.019644 -0.007434
3   -0.654153 -0.102885  0.023219  0.020114
4   -0.648788  0.133488  0.015116  0.011786
..        ...       ...       ...       ...
145  0.551462  0.059841  0.086283 -0.110092
146  0.407146 -0.171821 -0.004102 -0.065241
147  0.447143  0.037560  0.049546 -0.032743
148  0.488208  0.149678  0.239209  0.002864
149  0.312066 -0.031130  0.118672  0.052505


svdPCs 
            0         1         2         3
0   -0.630703  0.107578 -0.018719 -0.007307
1   -0.622905 -0.104260 -0.049142 -0.032359
2   -0.669520 -0.051417  0.019644 -0.007434
3   -0.654153 -0.102885  0.023219  0.020114
4   -0.648788  0.133488  0.015116  0.011786
..        ...       ...       ...       ...
145  0.551462  0.059841  0.086283 -0.110092
146  0.407146 -0.171821 -0.004102 -0.065241
147  0.447143  0.037560  0.049546 -0.032743
148  0.488208  0.149678  0.239209  0.002864
149  0.312066 -0.031130  0.118672  0.052505

你可以不用翻盖就实现。这些值将是相同的，您的PCA将是有效的，在另一个答案中指出。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67932137

复制

相似问题

问从零开始进行主成分分析的实现将数据定向不同于scikit-learn
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从零开始进行主成分分析的实现将数据定向不同于scikit-learnEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从零开始进行主成分分析的实现将数据定向不同于scikit-learn
EN