blocks|key|1177883|text|当您有一个不平衡的数据集时，该算法将对每个数据点的成功进行加权，这意味着多数类的重要性要比少数类重要得多。典型的解决方案是对多数类进行抽样，直到其大小与少数类相同，而另一种(类似的)解决方案是调整成本函数，以使少数类得到适当的加权。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1177884|有关更多信息，请参见这些类似的问题：|1177885|我应该选择“平衡”数据集还是“代表性”数据集？|unordered-list-item|offset|length|1177886|训练高度不平衡数据集的快速指南|1177887|用高度偏倚的数据集训练树集合有什么意义？|1177888|倾斜多类数据|1177889|最佳分类数据集中正负样本之比|entityMap|0|LINK|mutability|MUTABLE|url|https://datascience.stackexchange.com/questions/810/should-i-go-for-a-balanced-dataset-or-a-representative-dataset/|1|https://datascience.stackexchange.com/questions/1107/quick-guide-into-training-highly-imbalanced-data-sets/|2|https://datascience.stackexchange.com/questions/454/what-are-the-implications-for-training-a-tree-ensemble-with-highly-biased-datase|3|https://datascience.stackexchange.com/questions/736/skewed-multi-class-data/|4|https://datascience.stackexchange.com/questions/6939/ratio-of-positive-to-negative-sample-in-data-set-for-best-classification/^0|0|0|0|N|0|0|0|F|1|0|0|K|2|0|0|6|3|0|0|E|4^^$0|@$1|2|3|4|5|6|7|15|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|16|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|17|8|@]|9|@$G|18|H|19|1|1A]]|A|$]]|$1|I|3|J|5|F|7|1B|8|@]|9|@$G|1C|H|1D|1|1E]]|A|$]]|$1|K|3|L|5|F|7|1F|8|@]|9|@$G|1G|H|1H|1|1I]]|A|$]]|$1|M|3|N|5|F|7|1J|8|@]|9|@$G|1K|H|1L|1|1M]]|A|$]]|$1|O|3|P|5|F|7|1N|8|@]|9|@$G|1O|H|1P|1|1Q]]|A|$]]]|Q|$R|$5|S|T|U|A|$V|W]]|X|$5|S|T|U|A|$V|Y]]|Z|$5|S|T|U|A|$V|10]]|11|$5|S|T|U|A|$V|12]]|13|$5|S|T|U|A|$V|14]]]]

When you have an unbalanced data set, the algorithm is going to weight its success on each data point equally, meaning the majority class comes out as much more important than the minority class. The typical solution is to sample down the majority class until it's the same size as the minority class, and an alternate (similar) solution is to adjust the cost function so that the minority class is weighted appropriately.

See these similar questions for more:

<ul>
<li><a href="https://datascience.stackexchange.com/questions/810/should-i-go-for-a-balanced-dataset-or-a-representative-dataset/">Should I go for a 'balanced' dataset or a 'representative' dataset?</a></li>
<li><a href="https://datascience.stackexchange.com/questions/1107/quick-guide-into-training-highly-imbalanced-data-sets/">Quick Guide into training highly imbalanced data sets</a></li>
<li><a href="https://datascience.stackexchange.com/questions/454/what-are-the-implications-for-training-a-tree-ensemble-with-highly-biased-datase">What are the implications for training a tree ensemble with highly biased datasets?</a></li>
<li><a href="https://datascience.stackexchange.com/questions/736/skewed-multi-class-data/">Skewed multi-class data</a></li>
<li><a href="https://datascience.stackexchange.com/questions/6939/ratio-of-positive-to-negative-sample-in-data-set-for-best-classification/">Ratio of positive to negative sample in data set for best classification</a></li>
</ul>

I'm trying to predict rare events, meaning less than 1% of positive cases. I basically try to predict if a subject will have 0, 1, 2 ... , 6, > 6 failures (there are cases in all those categories).

I've tried several algorithms:

<ul>
<li>decision trees</li>
<li>random forest</li>
<li>adaboost</li>
<li>grouping using k-means clustering and finding associations with failures (which group has most failure)</li>
</ul>

In any case, learning either goes to no failure or has too much variance (leading poor reasults on C.V. set).

Do you know any machine learning algorithms which are better suited for rare events?

Or is it surprising that I get those bad results using those algorithms, which means that my features list is not good?

Thanks a lot.

Predictive analysis of rare events

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我试图预测罕见的事件，意味着不到1%的阳性病例。我基本上试着预测一个主题是否会有0，1，2.，6，>6的失败(在所有这些类别中都有)。我尝试过几种算法：决策树随机林阿达博斯特使用k-均值聚类进行分组，并发现与故障的关联(哪个组的故障最多)在任何情况下，学习要么没有失败，要么有太多的差异(导致在C.V.集上的错误结论)。你知道哪些机器学习算法更适合于罕见的事件吗？或者，我使用这些算法得到了这些糟糕的

问罕见事件的预测分析
EN

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问罕见事件的预测分析EN