文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用GridSearchCV测试嵌套流水线中的预处理组合？

问如何使用GridSearchCV测试嵌套流水线中的预处理组合？
EN

Stack Overflow用户

提问于 2020-09-02 03:56:31

回答 2查看 1.2K关注 0票数 1

我一直在研究这个分类问题，使用sklearn的管道将预处理步骤(scaling)和交叉验证步骤(GridSearchCV)结合使用Logistic回归。

以下是简化的代码：

# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler   

# scaler and encoder options
scaler = StandardScaler()   # there are 3 options that I want to try
encoder = OneHotEncoder()   # only one option, no need to GridSearch it

# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical', encoder, cat_columns),
                                                 ('numerical', scaler, num_columns),
                                                ])

# combine the preprocessor with LogisticRegression() using Pipeline 
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                  ('log_reg', LogisticRegression())])

我想做的是尝试不同的缩放方法(例如，标准缩放，稳健缩放等等)。在尝试了所有这些之后，选择产生最佳度量(即准确性)的缩放方法。但是，我不知道如何使用GridSearchCV：

from sklearn.model_selection import GridSearchCV

# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(), RobustScaler(), MinMaxScaler()]}

# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options, cv = 5)

# fit the data 
grid_cv.fit(X_train, y_train)

我知道上面的代码不能工作，特别是因为我将scaler_options设置为param_grid。我意识到我设置的scaler_options不能由GridSearchCV处理。为什么？因为它不是管道的超参数(与‘log_reg_C’不同，它是来自LogisticRegression()的超参数，而不是GridSearchCV可以访问的超参数)。但是，相反，它是ColumnTransformer的一个组件，我在full_pipeline中嵌套了这个组件。

因此，主要的问题是，如何使GridSearchCV自动化以测试所有的标量选项？因为定标器是子管道的一个组件(即以前的ColumnTransformer)。

logistic-regression

grid-search

python

machine-learning

pipeline

回答 2

Stack Overflow用户

发布于 2020-12-27 15:51:54

正如您建议的那样，您可以创建一个class，它接受它的__init()__ 参数，即您想要使用的scaler。

然后，您可以在网格搜索参数中指定您的类用于初始化类的Scaler。

我写到，我希望它能有所帮助：

class ScalerSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, scaler=StandardScaler()):
        super().__init__()
        self.scaler = scaler

    def fit(self, X, y=None):
        return self.scaler.fit(X)

    def transform(self, X, y=None):
        return self.scaler.transform(X)

在这里，您可以找到一个可以运行以进行测试的完整示例：

# import dependencies
from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler   
from sklearn.datasets import load_breast_cancer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler   

import pandas as pd

class ScalerSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, scaler=StandardScaler()):
        super().__init__()
        self.scaler = scaler

    def fit(self, X, y=None):
        return self.scaler.fit(X)

    def transform(self, X, y=None):
        return self.scaler.transform(X)


data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'], columns=data['feature_names'])
col_names = data.columns.tolist()

# scaler and encoder options
my_scaler = ScalerSelector()

preprocessor = ColumnTransformer(transformers = [('numerical', my_scaler, col_names)
                                                ])

# combine the preprocessor with LogisticRegression() using Pipeline 
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                  ('log_reg', LogisticRegression())
                                  ])

# set params combination I want to try
scaler_options = {'preprocessor__numerical__scaler':[StandardScaler(), RobustScaler(), MinMaxScaler()]}

# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options)

# fit the data 
grid_cv.fit(data, target)

# best params :
grid_cv.best_params_

票数 2

Stack Overflow用户

发布于 2021-03-17 15:31:10

您可以实现您想要的，而不需要创建自定义转换器。您甚至可以将'passthrough'参数传递到param_grid中，以便对不希望在该步骤中进行任何缩放的场景进行实验。

在这个例子中，假设我们想研究一下这个模型对数值特征num_features施加一个Scaler转换器是否更好。

cat_features = selector(dtype_exclude='number')(train.drop('target', axis=1))
num_features = selector(dtype_include='number')(train.drop('target', axis=1))

cat_preprocessor = Pipeline(steps=[
    ('oh', OneHotEncoder(handle_unknown='ignore')),
    ('ss', StandardScaler()) 
])
num_preprocessor = Pipeline(steps=[ 
    ('pt', PowerTransformer(method='yeo-johnson')),
    ('ss', StandardScaler()) # Create a place holder for your test here !!!                                   
]) 
preprocessor = ColumnTransformer(transformers=[ 
    ('cat', cat_preprocessor, cat_features),
    ('num', num_preprocessor, num_features)                                                       
])
model = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', RidgeClassifier())
])
X = train.drop('target', axis=1)
y = train['target']
param_grid = {
    'prep__cat__ss': ['passthrough', StandardScaler(with_mean=False)] # 'passthrough', 
}
gs = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=-1,
    cv=2
)
gs.fit(X, y)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63698484

复制

相似问题

问如何使用GridSearchCV测试嵌套流水线中的预处理组合？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用GridSearchCV测试嵌套流水线中的预处理组合？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用GridSearchCV测试嵌套流水线中的预处理组合？
EN