blocks|key|212766|text|不要仅仅依靠一些启发，因为有人提出了一个非常不同的问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|212767|聚类的关键是仔细考虑您正在处理的问题。提出数据的正确方法是什么？如何缩放(或不缩放)？如何度量两个记录的相似性，使其量化对您的领域有意义的东西。|212768|它不是关于选择正确的算法；你的任务是做的数学，把你的领域问题与算法所做的。别把它当成黑匣子。基于评估步骤选择方法是行不通的:已经太晚了；您可能已经在预处理中做了一些错误的决定，使用了错误的距离、缩放和其他参数。|212769|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

Don't just rely on some heuristic, that someone proposed for a very different problem.

Key to clustering is to carefully consider the problem that you are working on. What is the proper way of proposing the data? How to scale (or not scale)? How to measure the similarity of two records in a way that it quantifies something meaningful for your domain.

It is not about choosing the right algorithm; your task is to do the math that relates your domain problem to what the algorithm does. Don't treat it as a black box. Choosing the approach based on the evaluation step does not work: it is already too late; you probably did some bad decisions already in the preprocessing, used the wrong distance, scaling, and other parameters.

blocks|key|212900|text|如果您正在寻找更多无监督的聚类指标，除了您提到的标准外(为了更好地确定您的发现)，您可以尝试使用以下指标：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|212901|Gap统计数据：您可以查看纸在这里，这里是实施。|unordered-list-item|offset|length|style|BOLD|212902|Dunn索引：您可以阅读更多关于它的这里和这里。我在Python、这里和这里中找到了两个实现。|212903|Davies-Bouldin索引：您可以阅读更多关于度量这里、这里和这里的信息。我找到了一个实现这里和这里。|212904|entityMap|0|LINK|mutability|MUTABLE|url|https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00293|1|https://anaconda.org/milesgranger/gap-statistic/notebook|2|https://en.wikipedia.org/wiki/Dunn_index|3|https://clusteval.sdu.dk/1/clustering_quality_measures/3|4|https://github.com/jqmviegas/jqm_cvi|5|https://gist.github.com/douglasrizzo/cd7e792ff3a2dcaf27f6|6|https://stackoverflow.com/questions/48036593/is-my-python-implementation-of-the-davies-bouldin-index-correct|7|https://www.hackerearth.com/problem/approximate/davies-bouldin-index/|8|https://tomron.net/2016/11/30/davies-bouldin-index/|9|https://stackoverflow.com/a/48189218/8160718|10|https://github.com/akankshadara/Davies_Bouldin_Index_KMeans^0|0|0|7|D|4|0|L|2|1|0|0|6|I|2|2|L|2|3|X|2|4|10|2|5|0|0|G|S|2|6|V|2|7|Y|2|8|1C|2|9|1F|2|A|0^^$0|@$1|2|3|4|5|6|7|1E|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|1F|8|@$E|1G|F|1H|G|H]]|9|@$E|1I|F|1J|1|1K]|$E|1L|F|1M|1|1N]]|A|$]]|$1|I|3|J|5|D|7|1O|8|@$E|1P|F|1Q|G|H]]|9|@$E|1R|F|1S|1|1T]|$E|1U|F|1V|1|1W]|$E|1X|F|1Y|1|1Z]|$E|20|F|21|1|22]]|A|$]]|$1|K|3|L|5|D|7|23|8|@$E|24|F|25|G|H]]|9|@$E|26|F|27|1|28]|$E|29|F|2A|1|2B]|$E|2C|F|2D|1|2E]|$E|2F|F|2G|1|2H]|$E|2I|F|2J|1|2K]]|A|$]]|$1|M|3|-4|5|6|7|2L|8|@]|9|@]|A|$]]]|N|$O|$5|P|Q|R|A|$S|T]]|U|$5|P|Q|R|A|$S|V]]|W|$5|P|Q|R|A|$S|X]]|Y|$5|P|Q|R|A|$S|Z]]|10|$5|P|Q|R|A|$S|11]]|12|$5|P|Q|R|A|$S|13]]|14|$5|P|Q|R|A|$S|15]]|16|$5|P|Q|R|A|$S|17]]|18|$5|P|Q|R|A|$S|19]]|1A|$5|P|Q|R|A|$S|1B]]|1C|$5|P|Q|R|A|$S|1D]]]]

If you are looking for more unsupervised metrics for clustering besides the one you mentioned you (in order to be more sure of your findings) can give the following ones a try:

<ul>
<li>Gap Statistic : You can view the <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00293" rel="nofollow noreferrer">paper here</a> and here is <a href="https://anaconda.org/milesgranger/gap-statistic/notebook" rel="nofollow noreferrer">an implementation</a>.</li>
<li>Dunn index : You can read more about it <a href="https://en.wikipedia.org/wiki/Dunn_index" rel="nofollow noreferrer">here</a> and <a href="https://clusteval.sdu.dk/1/clustering_quality_measures/3" rel="nofollow noreferrer">here</a>. I found two implementations in Python <a href="https://github.com/jqmviegas/jqm_cvi" rel="nofollow noreferrer">here</a> and <a href="https://gist.github.com/douglasrizzo/cd7e792ff3a2dcaf27f6" rel="nofollow noreferrer">here</a>.</li>
<li>Davies-Bouldin Index : you can read more about the metric <a href="https://stackoverflow.com/questions/48036593/is-my-python-implementation-of-the-davies-bouldin-index-correct">here</a>, <a href="https://www.hackerearth.com/problem/approximate/davies-bouldin-index/" rel="nofollow noreferrer">here</a> and <a href="https://tomron.net/2016/11/30/davies-bouldin-index/" rel="nofollow noreferrer">here</a>. I found one implementation <a href="https://stackoverflow.com/a/48189218/8160718">here</a> and <a href="https://github.com/akankshadara/Davies_Bouldin_Index_KMeans" rel="nofollow noreferrer">here</a>.</li>
</ul>

I'm clustering data (trying out multiple algorithms) and trying to evaluate the coherence/integrity of the resulting clusters from each algorithm. I do not have any ground truth labels, which rules out quite a few metrics for analysing the performance. 

So far, I've been using Silhouette score as well as calinski harabaz score (from sklearn). With these scores, however, I can only compare the integrity of the clustering if my labels produced from an algorithm propose there to be at minimum, 2 clusters - but some of my algorithms propose that one cluster is the most reliable.

Thus, if you don't have any ground truth labels, how do you assess whether the proposed clustering by an algorithm is better than if all of the data was assigned in just one cluster?

How to analyse the integrity of clustering with no ground truth labels?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我正在对数据进行聚类(尝试多个算法)，并试图评估每个算法产生的集群的一致性/完整性。我没有任何基本的真实标签，这排除了相当多的指标来分析性能。到目前为止，我一直在使用剪影得分以及calinski harabaz得分(从滑雪)。然而，有了这些分数，我才能比较聚类的完整性，如果我从算法中产生的标签建议至少有两个聚类--但我的一些算法认为一个集群是最可靠的。因此，如果您没有任何基本的真实标签，您如何评估

问如何分析没有真实标签的聚类的完整性？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何分析没有真实标签的聚类的完整性？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何分析没有真实标签的聚类的完整性？
EN