我正在尝试在Kaggle Iowa住房数据集上训练一个LightGBM模型,我写了一个小脚本,在给定的范围内随机尝试不同的参数。我不确定我的代码出了什么问题,但是脚本用不同的参数返回相同的分数,这是不应该发生的。我在Catboost上尝试了相同的脚本,它可以正常工作,所以我猜问题出在LGBM上。
代码:
import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from random import choice, randrange, uniform
complete_train = pd.read_csv(
"train.csv",
encoding = "UTF-8",
index_col = "Id")
complete_test = pd.read_csv(
"test.csv",
encoding = "UTF-8",
index_col = "Id")
def encode_impute(*datasets):
for dataset in datasets:
for column in dataset.columns:
dataset[
column].fillna(
-999,
inplace = True)
if dataset[
column].dtype == "object":
dataset[
column] = dataset[
column].astype("category", copy = False)
encode_impute(
complete_train,
complete_test)
X = complete_train.drop(
columns = "SalePrice")
y = complete_train[
"SalePrice"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
def objective():
while True:
params = {
"boosting_type": choice(["gbdt", "goss", "dart", "rf"]),
"num_leaves": randrange(10000),
"learning_rate": uniform(0.01, 1),
"subsample_for_bin": randrange(100000000),
"min_data_in_leaf": randrange(100000000),
"reg_alpha": uniform(0, 1),
"reg_lambda": uniform(0, 1),
"feature_fraction": uniform(0, 1),
"bagging_fraction": uniform(0, 1),
"bagging_freq": randrange(1, 100)}
params["bagging_fraction"] = 1.0 if params[
"boosting_type"] == "goss" else params[
"bagging_fraction"]
model = LGBMRegressor().set_params(**params)
model.fit(X_train, y_train)
predictions = model.predict(X_valid)
error_rate = mean_absolute_error(
y_valid, predictions)
print(f"Score = {error_rate} with parameters: {params}","\n" *5)
objective()我得到的输出示例:
分数= 55967.70375930444,参数:{'boosting_type':'gbdt','num_leaves':6455,'learning_rate':0.2479700848039991,'subsample_for_bin':83737077,'min_data_in_leaf':51951103,'reg_alpha':0.1856001984332697,'reg_lambda':0.7849262049058852,'feature_fraction':0.10550627738309537,'bagging_fraction':0.2613298736131875,'bagging_freq':96}
分数= 55967.70375930444,参数:{'boosting_type':'dart','num_leaves':9678,'learning_rate':0.28670432435369037,'subsample_for_bin':24246091,'min_data_in_leaf':559094,'reg_alpha':0.07261459695501371,'reg_lambda':0.8834743560240725,'feature_fraction':0.5361519020265366,'bagging_freq':0.9120030047714073,‘bagging_freq’:10}
分数= 55967.70375930444,参数:{'boosting_type':'goss','num_leaves':4898,'learning_rate':0.09237499846487345,'subsample_for_bin':32620066,'min_data_in_leaf':71317820,'reg_alpha':0.9818297737748625,'reg_lambda':0.11638265354331834,'feature_fraction':0.4230342728468828,'bagging_fraction':1.0,‘bin_freq’:64}
发布于 2020-06-18 14:56:55
我要指出的是,所有选项中的min_data_in_leaf参数似乎都很高,我怀疑模型没有学习到任何东西,只是发送只有根节点的响应变量的平均值。
https://stackoverflow.com/questions/62436635
复制相似问题