首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在imblearn管道中用SMOTENC实现AttributeError

在imblearn管道中用SMOTENC实现AttributeError
EN

Stack Overflow用户
提问于 2021-03-03 11:44:19
回答 1查看 538关注 0票数 1

我正在尝试用FAMD、SMOTENC和其他预处理步骤实现管道。然而,它每次都会产生错误。如果我从管道中移除FAMD,它可以正常工作。

我的代码:

代码语言:javascript
复制
#Seperate the dataset in two parts
num_df= X_train_new.select_dtypes(include=[np.number]).columns
cat_df= X_train_new.select_dtypes(exclude=[np.number]).columns

#Create a mask for categorical features
categorical_feature_mask = X_train_new.dtypes == object
print(categorical_feature_mask)

from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector

#Create a pipeline to automate the preprocessing steps and SMOTENC together
num_pipe = make_pipeline(SimpleImputer(strategy='median'))
cat_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
                          OneHotEncoder(handle_unknown='ignore'))
transformer= make_column_transformer((num_pipe, selector(dtype_include='number')),
                                      (cat_pipe, selector(dtype_include='object')),n_jobs=2)
#Undersampling with SMOTENC
from imblearn.over_sampling import SMOTENC
smote= SMOTENC(categorical_features=categorical_feature_mask,random_state=99)

!pip install prince
from prince import FAMD
famd=FAMD(n_components=4,random_state=99)

from imblearn.pipeline import make_pipeline as imb_pipeline
#Fit the random forest learner
rf=RandomForestClassifier(n_estimators=300random_state=99)
pipe=imb_pipeline(transformer,smote,famd,rf)
pipe.fit(X_train_new,y_train_new)
print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))

错误:

代码语言:javascript
复制
AttributeError                            Traceback (most recent call last)

<ipython-input-24-2b7ea084a318> in <module>()
      3 rf=RandomForestClassifier(n_estimators=300,max_features=3,criterion='entropy',random_state=99)
      4 pipe=imb_pipeline(transformer,smote,famd,rf)
----> 5 pipe.fit(X_train_new,y_train_new)
      6 print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))

6 frames

/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params)
    235 
    236         """
--> 237         Xt, yt, fit_params = self._fit(X, y, **fit_params)
    238         if self._final_estimator is not None:
    239             self._final_estimator.fit(Xt, yt, **fit_params)

/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params)
    195                     Xt, fitted_transformer = fit_transform_one_cached(
    196                         cloned_transformer, None, Xt, yt,
--> 197                         **fit_params_steps[name])
    198                 elif hasattr(cloned_transformer, "fit_resample"):
    199                     Xt, yt, fitted_transformer = fit_resample_one_cached(

/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params)
    564 def _fit_transform_one(transformer, weight, X, y, **fit_params):
    565     if hasattr(transformer, 'fit_transform'):
--> 566         res = transformer.fit_transform(X, y, **fit_params)
    567     else:
    568         res = transformer.fit(X, y, **fit_params).transform(X)

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    572         else:
    573             # fit method of arity 2 (supervised transformation)
--> 574             return self.fit(X, y, **fit_params).transform(X)
    575 
    576 

/usr/local/lib/python3.7/dist-packages/prince/famd.py in fit(self, X, y)
     27 
     28         # Separate numerical columns from categorical columns
---> 29         num_cols = X.select_dtypes(np.number).columns.tolist()
     30         cat_cols = list(set(X.columns) - set(num_cols))
     31 

/usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
    689             return self.getnnz()
    690         else:
--> 691             raise AttributeError(attr + " not found")
    692 
    693     def transpose(self, axes=None, copy=False):

AttributeError: select_dtypes not found
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-03-04 15:23:34

tl;dr:尝试将sparse=False添加到OneHotEncoder中。考虑使用prince来处理稀疏输入。

从回溯中可以看出,问题在于FAMD.fit尝试X.select_dtypes来分离分类数据和数字数据。select_dtypes是一个熊猫函数,所以通常我会假设prince是用来操作数据格式的,而不是用来在内部使用的numpy数组(在必要时从框架转换后)。但是,看一下源代码,它们从numpy数组到dataframe转换了几行代码。但是,最后一条线索是来自斯派西。这意味着您的X实际上可能是一个稀疏数组。实际上,OneHotEncoder (在管道的早期)更喜欢输出稀疏数组,而ColumnTransformer根据其组件部分和参数sparse_threshold来决定是否转换为稀疏或密集。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66456410

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档