有没有一种方法可以在NaNs中使用SMOTE?
下面是一个在存在NaN值的情况下尝试使用SMOTE的虚拟程序
# Imports
from collections import Counter
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Imputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
# Initial number of samples per class
print('Number of samples for both classes: {} and {}.'.format(*Counter(y).values()))
# SMOTEd class distribution
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = SMOTE().fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
# Generate artificial missing values
X[X > 1.0] = np.nan
print('Dataset has %s missing values.' % np.isnan(X).sum())
#_, y_resampled = make_pipeline(Imputer(), SMOTE()).fit_sample(X, y)
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))我得到以下输出/错误:
Number of samples for both classes: 212 and 357.
Dataset has 0 missing values.
Number of samples for both classes: 357 and 357.
Dataset has 6051 missing values.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').发布于 2019-08-12 18:35:09
您已经包含了答案。请注意,使用的是fit_resample而不是fit_sample。您应该按如下方式使用make_pipeline:
# Imports
import numpy as np
from collections import Counter
from sklearn.datasets import load_breast_cancer
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
X[X > 1.0] = np.nan
# Over-sampling
smote = SMOTE(ratio='auto',k_neighbors=5, n_jobs=-1)
smote_enn = make_pipeline(SimpleImputer(), SMOTEENN(smote=smote))
_, y_res = smote_enn.fit_resample(X, y)
# Class distribution
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_res).values()))还要检查你的不平衡学习版本。
发布于 2019-08-12 16:33:07
通常不会,SMOTE正在为进一步的模型拟合准备数据集。
常用模型(如随机森林等)不要在label变量中使用NA,因为您在这里实际预测的是什么?对于预测变量中的NA也是如此,其中大多数算法要么不起作用,要么干脆忽略NA的情况。
因此,错误在很大程度上是设计出来的,因为你不能也不应该在算法的训练数据集中有缺失值,从逻辑上讲,你不想“平衡”缺失值的情况,你只想用有效的标签打击情况。
如果你觉得缺失的标签仍然代表了应该平衡的有效信息(例如,你实际上想要对NA类进行过采样,因为你认为它是不充分的),那么它不应该是一个缺失值,而是一个称为“未知”的已定义值或其他东西,表明一个具有"NA“特征的已知类,但我真的看不到任何有意义的研究问题。
更新1:
另一种方法是首先估算缺失值,这样在拟合模型时实际上有三个步骤:
https://stackoverflow.com/questions/57456475
复制相似问题