blocks|key|6564081|text|首先，您可以在一组较小的随机行上优化您的函数，以减少计算时间。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|6564082|然后，你可以优化他们在第一轮，而不是整个数量的回合。一般说来，在第一轮中效果好的，对其他几轮都很好。|6564083|但是，我建议在最后的几个场景中对完整的回合和数据集进行测试，以确保它实际上是一个很好的最佳设置。|6564084|最后，使用快速GPU。如果它们太贵，像纸空间这样的云服务就有好的。|offset|length|entityMap|0|LINK|mutability|MUTABLE|url|https://www.paperspace.com/^0|0|0|0|J|3|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|R|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|S|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|T|8|@]|9|@$H|U|I|V|1|W]]|A|$]]]|J|$K|$5|L|M|N|A|$O|P]]]]

First, you can optimize your function on a smaller set of random rows to reduce calculating time.
Then, you can optimize them on the first round, instead of the whole quantity of rounds. Generally speaking, what works well on the initial rounds, works well for the other ones.
However, I recommend testing on the complete rounds and data set at the end with a few scenarios, to ensure that it is actually a good optimal setting.
Finally, use fast GPUs. If they are too expensive, cloud services like <a href="https://www.paperspace.com/" rel="nofollow noreferrer">Paperspace</a> have good ones.

blocks|key|6564125|text|您可以使用sklearn+GridSearchCV，它有一个名为n_jobs的参数，并根据文档|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|6564126|n_jobs+:+int，要并行运行的default=None作业数。除非在joblib.parallel_backend上下文中，否则没有1的意思。-1指使用所有处理器。有关详细信息，请参阅术语表。|blockquote|6564127|因此，通过设置n_jobs=-1，您将同时运行多个实验，获得最佳超参数的总时间将减少。|6564128|然而，有更好的技术比网格搜索，例如，贝叶斯优化。通过贝叶斯优化，可以让过去几轮的信息指导寻找最佳的超参数，因此，与网格搜索相比，迭代次数少可以获得更好的结果。Optuna是一个python库，它允许您进行贝叶斯优化，还允许您进行很容易并行化调优。|entityMap|0|LINK|mutability|MUTABLE|url|https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/004_distributed.html^0|D|C|W|6|0|0|7|9|0|36|8|0^^$0|@$1|2|3|4|5|6|7|T|8|@$9|U|A|V|B|C]|$9|W|A|X|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|Y|8|@]|D|@]|E|$]]|$1|I|3|J|5|6|7|Z|8|@$9|10|A|11|B|C]]|D|@]|E|$]]|$1|K|3|L|5|6|7|12|8|@]|D|@$9|13|A|14|1|15]]|E|$]]]|M|$N|$5|O|P|Q|E|$R|S]]]]

You can use sklearn <code>GridSearchCV</code>, which has a parameter called <code>n_jobs</code>, and according to the documentation
<blockquote>
n_jobs : int, default=None
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
</blockquote>
So by setting <code>n_jobs=-1</code> you'll be running several experiments at the same time, and the overall time to get the best hyperparameters will be reduced.
However, there are better techniques than grid search, for example, bayesian optimization. With bayesian optimization, you let the information of past rounds guide where to look for the best hyperparameters, so with less iterations you can get better results than in grid search. Optuna is a python library that allows you to do bayesian optimization, and also allows you to <a href="https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/004_distributed.html" rel="nofollow noreferrer">easily parallelize the tuning</a>.

blocks|key|28696|text|这是优化网格cv+https://scikit-learn.org/stable/modules/grid_Search.html#连续-减半-用户指南的一种非常新的方法。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|entityMap|0|LINK|mutability|MUTABLE|url|https://scikit-learn.org/stable/modules/grid_search.html#successive-halving-user-guide^0|9|1V|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@$A|L|B|M|1|N]]|C|$]]]|D|$E|$5|F|G|H|C|$I|J]]]]

this is a pretty new way of optimized grid cv <a href="https://scikit-learn.org/stable/modules/grid_search.html#successive-halving-user-guide" rel="nofollow noreferrer">https://scikit-learn.org/stable/modules/grid_search.html#successive-halving-user-guide</a>

blocks|key|28774|text|是的:拿你的数据子集。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|28775|如果您有500+K行，一种方法是随机抽样两个5K行块。在第一个块上运行网格搜索(有关建议请参阅其他答案)。然后在你的第二个街区重复实验。如果你选择的超参数第二次是相同的，微笑，并使用它们。如果不同，选择一个更大的样本，并比较你的两个候选人。|28776|这个基本思想有很多不同的地方。您可以使用前两次运行来缩小超参数，但随后增加样本大小来对其进行微调。|28777|我上面说的是随机抽样，还考虑取有代表性的样本。这意味着确保样本中30个属性的分布大致与其在整个500+K行中的分布相匹配。此外，如果数据集是不平衡的，或者有许多缺失的值，这可能需要对您正在尝试实现的目标进行一些仔细的思考。|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|J|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|K|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|L|8|@]|9|@]|A|$]]]|H|$]]

Yes: take subsets of your data.
Given you have 500K rows, one approach is to randomly sample two blocks of 5K rows. Run grid search (see other answers for suggestions for that) on your first block. Then repeat the experiment on your second block. If your chosen hyperparameters are the same the second time, smile, and use them. If different, choose a bigger sample, and compare your two candidates.
There are loads of variations on this basic idea. You could use the first two runs to narrow down the hyperparameters, but then increase the sample sizes to fine-tune them.
I said random samples above, but also consider taking representative samples. This would mean making sure the distribution of each of your 30 attributes in your sample roughly matches its distribution in the entire 500K rows. Also, if your dataset is unbalanced, or has lots of missing values, this can require some careful thinking about what you are trying to achieve.

Say that I have a dataset that contains 30 attributes which all of it is vital for my prediction and the dataset contains 500k rows. I would like to do a grid search for the best hyperparameters for the XGB model. How could i speed up the hyparameter search time as it would take a long time to find best parameters? would subset the data be useful?

Hyperparameter search

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

假设我有一个包含30个属性的数据集，所有这些属性对我的预测都至关重要，数据集包含500 K行。我想在网格中搜索XGB模型的最佳超参数。我怎样才能加快超参数搜索的时间，因为找到最佳参数需要很长时间。数据子集有用吗？

问超参数搜索
EN

回答 4

Data Science用户

Data Science用户

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问超参数搜索EN

回答 4

Data Science用户

Data Science用户

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问超参数搜索
EN