文章/答案/技术大牛

发布

社区首页 >问答首页 >对整个数据集进行超参数调优？

问对整个数据集进行超参数调优？
EN

Stack Overflow用户

提问于 2018-04-11 14:23:04

回答 1查看 443关注 0票数 2

这可能是一个奇怪的问题，因为我还没有完全理解超参数调优。

目前，我正在使用gridSearchCV of sklearn来调优randomForestClassifier的参数，如下所示：

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_

之后，我将检查gs对象的best_params和best_score。现在，我使用best_params实例化一个RandomForestClassifier，并再次使用分层验证来记录度量并打印一个混淆矩阵：

rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0

print('################################################### RandomForest ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)
    metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100
print('meanAcc: ', meanAcc)

这是一个合理的方法，还是我有什么完全错误的地方？

编辑：

我刚刚测试了以下内容：

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)

这产生了best_score = 0.5362903225806451 at best_index = 28。当我检查索引28的3倍的准确性时，我得到：

split0: 0.5185929648241207
split1: 0.526686807653575
split2: 0.5637651821862348

这就导致了平均测试精度: 0.5362903225806451。best_params：{'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}

现在，我运行下面的代码，它使用所提到的具有分层3倍分割的best_params (如GridSearchCV)：

rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, max_depth=21, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0
print('################################################### RandomForest_Gini ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred))
    metrics['accuracy'].append(accuracy_score(y_test, y_pred))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= np.mean(np.asarray(metrics['accuracy']))
print('meanAcc: ', meanAcc)

度量字典的准确性完全相同(split0: 0.5185929648241207，split1: 0.526686807653575，split2: 0.5637651821862348)

然而，平均计算值有一点偏差: 0.5363483182213101。

python

machine-learning

hyperparameters

回答 1

Stack Overflow用户

发布于 2018-04-11 14:38:19

虽然这似乎是一种很有希望的方法，但您正在冒风险:您正在调优，然后使用相同的数据集评估此调优的结果。

虽然在某些情况下，这是一种合法的方法，但我会仔细检查最终得到的度量与报告的best_score之间的差异。如果这些距离很远，您应该只在培训集上优化您的模型(您现在正在使用所有的方法进行调优)。实际上，这意味着预先执行拆分，并确保GridSearchCV没有看到测试集。

可以这样做：

train_x, train_y, val_x, val_y = train_test_split(X_distances, Y, test_size=0.3, random_state=42)

然后在train_x, train_y上运行调优和培训。

另一方面，如果这两个分数接近，我想你很好。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49777618

复制

相似问题

问对整个数据集进行超参数调优？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对整个数据集进行超参数调优？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对整个数据集进行超参数调优？
EN