文章/答案/技术大牛

发布

社区首页 >问答首页 >不平衡学习过采样后用于训练的形状输出

问不平衡学习过采样后用于训练的形状输出
EN

Stack Overflow用户

提问于 2019-07-02 23:14:18

回答 1查看 656关注 0票数 3

我正在使用不平衡学习来过采样我的数据。我想知道使用过采样方法后每个类中有多少个条目。这段代码运行得很好：

import imblearn.over_sampling import SMOTE
from collections import Counter

def oversample(x_values, y_values):
    oversampler = SMOTE(random_state=42, n_jobs=-1)
    x_oversampled, y_oversampled = oversampler.fit_resample(x_values, y_values)
    print("Oversampling training set from {0} to {1} using {2}".format(dict(Counter(y_values)), dict(Counter(y_over_sampled)), oversampling_method))
    return x_oversampled, y_oversampled

但我转而使用管道，这样我就可以使用GridSearchCV来找到最佳的过采样方法(在ADASYN、SMOTE和BorderlineSMOTE之外)。因此，我自己从来不会调用fit_resample，也不会使用下面这样的代码来丢失输出：

from imblearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier())])
pipe.fit(x_values, y_values)

上采样有效，但我丢失了关于训练集中每个类有多少个条目的输出。

有没有一种方法可以获得与第一个使用管道的示例类似的输出？

python

python-3.x

scikit-learn

oversampling

imblearn

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-17 21:40:01

理论上是这样的。当安装过采样器时，将创建一个属性sampling_strategy_，其中包含调用fit_resample时要生成的少数类的样本数。您可以使用它来获得与上面示例类似的输出。以下是基于您的代码的修改后的示例：

# Imports
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE    
from imblearn.pipeline import Pipeline

# Create toy dataset
X, y = make_classification(weights=[0.20, 0.80], random_state=0)
init_class_distribution = Counter(y)
min_class_label, _ = init_class_distribution.most_common()[-1]
print(f'Initial class distribution: {dict(init_class_distribution)}')

# Create and fit pipeline
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier(random_state=23))])
pipe.fit(X, y)
sampling_strategy = dict(pipe.steps).get('sampler').sampling_strategy_
expected_n_samples = sampling_strategy.get(min_class_label)
print(f'Expected number of generated samples: {expected_n_samples}')

# Fit and resample over-sampler pipeline
 sampler_pipe = Pipeline(pipe.steps[:-1])
X_res, y_res = sampler_pipe.fit_resample(X, y)
actual_class_distribution = Counter(y_res)
print(f'Actual class distribution: {actual_class_distribution}')

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56855496

复制

相似问题

问不平衡学习过采样后用于训练的形状输出
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问不平衡学习过采样后用于训练的形状输出EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问不平衡学习过采样后用于训练的形状输出
EN