我所做的工作如下:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, train_test_split
import lightgbm as lgb
param_test ={
'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4]
}
clf = lgb.LGBMClassifier(boosting_type='gbdt',\
num_leaves=31, \
max_depth=-1, \
n_estimators=100, \
subsample_for_bin=200000, \
objective='multiclass', \
class_weight=balanced, \
min_split_gain=0.0, \
min_child_weight=0.001, \
min_child_samples=20, \
subsample=1.0, \
subsample_freq=0, \
colsample_bytree=1.0, \
reg_alpha=0.0, \
reg_lambda=0.0, \
random_state=None,\
n_jobs=-1,\
silent=True, \
importance_type='split'
)
gs = GridSearchCV(
estimator=clf,
param_grid = param_test,
scoring='roc_auc',
cv=3
)
gs.fit(X_train, y_train_lbl["target_encoded"].values)我得到了以下错误:
/home/cdsw/.local/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
597 """
598 if is_multimetric:
--> 599 return _multimetric_score(estimator, X_test, y_test, scorer)
600 else:
601 if y_test is None:
/home/cdsw/.local/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
627 score = scorer(estimator, X_test)
628 else:
--> 629 score = scorer(estimator, X_test, y_test)
630
631 if hasattr(score, 'item'):
/home/cdsw/.local/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, clf, X, y, sample_weight)
173 y_type = type_of_target(y)
174 if y_type not in ("binary", "multilabel-indicator"):
--> 175 raise ValueError("{0} format is not supported".format(y_type))
176
177 if is_regressor(clf):
**ValueError: multiclass format is not supported**因此,不支持的多类值错误使我感到困惑。我是不是漏掉了一些基本面?我用auc作为衡量标准。这应该是multi_logloss吗?我也试过了没有结果。
发布于 2019-11-06 15:14:32
roc_auc不能作为scikit学习中的多类模型的度量,只能用于二进制分类器或1-VS-rest分类器。Scikit-learn的文档讨论它的这里。
发布于 2019-11-06 17:05:50
我无法运行代码,但我想这是因为您选择了scoring,特别是roc_auc。此度量/损失函数仅用于二进制分类,而您有一个多类问题。您可以尝试只使用accuracy_score,但是当类在dataset中有不同的比率时,效果会很差。Kaggle上的人经常使用MultiClass日志丢失来解决这类问题。这是代码,我找到了这里。
import numpy as np
def multiclass_log_loss(y_true, y_pred, eps=1e-15):
"""Multi class version of Logarithmic Loss metric.
https://www.kaggle.com/wiki/MultiClassLogLoss
idea from this post:
http://www.kaggle.com/c/emc-data-science/forums/t/2149/is-anyone-noticing-difference-betwen-validation-and-leaderboard-error/12209#post12209
Parameters
----------
y_true : array, shape = [n_samples]
y_pred : array, shape = [n_samples, n_classes]
Returns
-------
loss : float
"""
predictions = np.clip(y_pred, eps, 1 - eps)
# normalize row sums to 1
predictions /= predictions.sum(axis=1)[:, np.newaxis]
actual = np.zeros(y_pred.shape)
rows = actual.shape[0]
actual[np.arange(rows), y_true.astype(int)] = 1
vsota = np.sum(actual * np.log(predictions))
return -1.0 / rows * vsota然而,我想对GridSearchCV来说,学习滑雪是不够的。您可以使用像上面这样的函数这样的自定义评分器,但是需要添加make_scorer装饰器:
注意,当使用自定义得分时,每个得分手应该返回一个值。返回一个列表/值数组的度量函数可以包装成多个评分器,每个评分器返回一个值。(摘自sklearn文件)
from sklearn.metrics import make_scorer
@make_scorer
def multiclass_log_loss(y_true, y_pred, eps=1e-15):
# function from snippet above注意模型输出和输入这一功能,形状应该是相同的。另外,我建议阅读GridSearchCV文档--它可能也会有所帮助。
https://datascience.stackexchange.com/questions/62765
复制相似问题