首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在管道内使用SMOTENC (错误:某些分类索引超出范围)?

如何在管道内使用SMOTENC (错误:某些分类索引超出范围)?
EN

Stack Overflow用户
提问于 2019-01-24 08:47:43
回答 4查看 3.3K关注 0票数 3

如果您能让我知道如何使用SMOTENC,我将非常感激。我写道:

代码语言:javascript
复制
# Data
XX = pd.read_csv('Financial Distress.csv')
y = np.array(XX['Financial Distress'].values.tolist())
y = np.array([0 if i > -0.50 else 1 for i in y])
Na = np.array(pd.read_csv('Na.csv', header=None).values)

XX = XX.iloc[:, 3:127]

# Use get-dummies to convert categorical features into dummy ones
dis_features = ['x121']
X = pd.get_dummies(XX, columns=dis_features)

# # Divide Data into Train and Test
indices = np.arange(y.shape[0])
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(X, y, indices, stratify=y, test_size=0.3,
                                                                         random_state=42)
num_indices=list(X)[:X.shape[1]-37]
cat_indices=list(X)[X.shape[1]-37:]
num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))

pipeline=Pipeline(steps= [
    # Categorical features
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', MultiColumn(cat_indices)),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', MultiColumn(num_indices)),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)
pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices1), pipeline)

# # Grid Search to determine best params
cv=StratifiedKFold(n_splits=5,random_state=42)
rg_cv = GridSearchCV(pipeline_with_resampling, param_grid, cv=cv, scoring = 'f1')
rg_cv.fit(X_train, y_train)

因此,正如我所指出的,我有5个分类特征。实际上,索引123到160与一个分类特性相关,其中有37个可能的值,使用get_dummies将其转换为37个列。不幸的是,它引发以下错误:

代码语言:javascript
复制
Traceback (most recent call last):
  File "D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Final Logit/SMOTENC/logit-final - Copy.py", line 424, in <module>
    rg_cv.fit(X_train, y_train)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 237, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 200, in _fit
    cloned_transformer, Xt, yt, **fit_params_steps[name])
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 342, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\pipeline.py", line 576, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\base.py", line 85, in fit_resample
    output = self._fit_resample(X, y)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote.py", line 940, in _fit_resample
    self._validate_estimator()
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote.py", line 933, in _validate_estimator
    ' should be between 0 and {}'.format(self.n_features_))
ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 160

提前谢谢。

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2019-10-25 13:54:27

如下所示,应使用两条管道:

代码语言:javascript
复制
num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:120,121:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,120]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))
cat_indices = [94, 96, 98, 99, 120]

from imblearn.pipeline import make_pipeline

pipeline=Pipeline(steps= [
    # Categorical features
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', MultiColumn(cat_indices1)),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', MultiColumn(num_indices1)),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)
pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices), pipeline)
票数 2
EN

Stack Overflow用户

发布于 2019-08-09 07:56:19

你不能假想你的分类变量,然后再使用它,因为它已经在算法get_dummies中实现了什么会使你的模型有偏差。但是,我建议使用SMOTE ()而不是SMOTENC (),但在本例中,您必须首先应用get_demmies。

票数 1
EN

Stack Overflow用户

发布于 2021-02-24 16:33:36

您不能使用scikit学习管道和imblearn管道。imblearn管道实现了fit_sample和fit_predict。Sklearn管道onle实现了fit_predict。你不能把它们结合起来。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54342569

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档