文章/答案/技术大牛

发布

社区首页 >问答首页 >Python集群“纯度”指标

问Python集群“纯度”指标
EN

Stack Overflow用户

提问于 2015-12-03 00:14:17

回答 3查看 20.5K关注 0票数 11

我正在使用来自sklearn.mixture的Gaussian Mixture Model (GMM)来执行我的数据集的聚类。

我可以使用函数score()来计算模型下的对数概率。

但是，我正在寻找在this article中定义的名为“purity”的指标。

我如何在Python中实现它？我当前的实现如下所示：

from sklearn.mixture import GMM

# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)

clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)

# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)

但是我不能遍历每个集群来计算混淆矩阵(根据这个question)

python

scikit-learn

cluster-analysis

回答 3

Stack Overflow用户

发布于 2015-12-03 00:29:02

sklearn没有实现集群纯度指标。您有两个选项：

自己使用sklearn数据结构实现测量。This和this有一些用于测量纯度的python源代码，但是您的数据或函数体需要进行调整，以便彼此兼容。
使用(不太成熟的) PML库，它确实实现了集群纯度。

票数 5

Stack Overflow用户

发布于 2017-07-06 08:21:37

一份很晚的贡献。

您可以尝试像这样实现它，就像在这个gist中一样

def purity_score(y_true, y_pred):
    """Purity score
        Args:
            y_true(np.ndarray): n*1 matrix Ground truth labels
            y_pred(np.ndarray): n*1 matrix Predicted clusters

        Returns:
            float: Purity score
    """
    # matrix which will hold the majority-voted labels
    y_voted_labels = np.zeros(y_true.shape)
    # Ordering labels
    ## Labels might be missing e.g with set like 0,2 where 1 is missing
    ## First find the unique labels, then map the labels to an ordered set
    ## 0,2 should become 0,1
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true==labels[k]] = ordered_labels[k]
    # Update unique labels
    labels = np.unique(y_true)
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bins
    # the bigger being excluded [bin_i, bin_i+1[
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_pred):
        hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
        # Find the most present label in the cluster
        winner = np.argmax(hist)
        y_voted_labels[y_pred==cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

票数 4

Stack Overflow用户

发布于 2019-07-13 03:31:37

currently top voted answer正确地实现了纯度度量，但可能在所有情况下都不是最合适的度量，因为它不能确保每个预测的集群标签仅分配给真实标签一次。

例如，考虑一个非常不平衡的数据集，一个标签有99个示例，另一个标签有1个示例。然后，任何集群(例如:具有大小为50的两个相等的集群)将达到至少0.99的纯度，从而使其成为无用的度量。

相反，在集群的数量与标签的数量相同的情况下，集群精度可能更合适。这具有在无监督设置中反映分类准确性的优点。为了计算聚类精度，我们需要使用Hungarian algorithm来找到聚类标签和真实标签之间的最佳匹配。SciPy函数linear_sum_assignment执行以下操作：

import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment

def cluster_accuracy(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)

    # Find optimal one-to-one mapping between cluster labels and true labels
    row_ind, col_ind = linear_sum_assignment(-contingency_matrix)

    # Return cluster accuracy
    return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/34047540

复制

相似问题

问Python集群“纯度”指标
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python集群“纯度”指标EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python集群“纯度”指标
EN