文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用sklearn train_test_split对数据进行分层以进行多标签分类？

问如何使用sklearn train_test_split对数据进行分层以进行多标签分类？
EN

Data Science用户

提问于 2019-02-06 15:51:53

回答 4查看 27.7K关注 0票数 11

我试图通过艾哈迈德·贝斯镜像一个机器学习程序，但是为了多标签分类而扩大了规模。对数据分层的任何尝试似乎都会返回以下错误：The least populated class in y has only 1 member, which is too few. The minimum number of labels for any class cannot be less than 2.

在我的数据集中，我有一个列，其中包含干净的、标记化的文本。其他8列用于基于该文本内容的分类。需要注意的是，第1-4栏的样本明显多于5-8(从文本中派生出的更模糊的分类)。

下面是我的代码中的一个通用示例：

x = data['cleaned_text']
y = data[['car','truck','ford','chevy','black','white','parked', 'driving']]

x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.1,
                                                    random_state=42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

输出：(6293,) (700,) (6293, 8) (700, 8)

将stratify=y添加到train_test_split返回前面提到的错误。即使我将y限制为一列，我仍然会得到错误。

如何对数据进行分层，以便在培训集中给程序一个公平的外观？

machine-learning

scikit-learn

multilabel-classification

回答 4

Data Science用户

发布于 2019-11-28 06:42:48

试试这个：

from skmultilearn.model_selection import iterative_train_test_split X_train, y_train, X_test, y_test = iterative_train_test_split(x, y, test_size = 0.1)

由于您正在进行多标签分类，所以很可能会得到每个类的唯一组合，这就是导致sklearn错误的原因。您必须使用一个特殊的库来进行多标签分层拆分。

关于如何使用多学习的更多详细信息

票数 9

Data Science用户

发布于 2019-02-11 16:13:44

您所得到的错误表明它不能执行分层拆分，因为您的一个类只有一个示例。您至少需要每个类的两个样本，才能将一个放在训练拆分中，另一个放在测试拆分中。你应该检查你的班级分类是什么，以找到罪魁祸首。

票数 4

Data Science用户

发布于 2019-11-28 07:39:58

有一个用于类分层的独立模块，没有人会建议您为此使用train_test_split。实现这一目标的办法如下：

from sklearn.model_selection import StratifiedKFold


train_all = []
evaluate_all = []
skf = StratifiedKFold(n_splits=cv_total, random_state=1234, shuffle=True)
for train_index, evaluate_index in skf.split(train_df.index.values, train_df.coverage_class):
    train_all.append(train_index)
    evaluate_all.append(evaluate_index)
    print(train_index.shape,evaluate_index.shape) # the shape is slightly different in different cv, it's OK

# Getting each batch
def get_cv_data(cv_index):
    train_index = train_all[cv_index-1]
    evaluate_index = evaluate_all[cv_index-1]
    x_train = np.array(train_df.images[train_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    y_train = np.array(train_df.masks[train_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    x_valid = np.array(train_df.images[evaluate_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    y_valid = np.array(train_df.masks[evaluate_index].map(upsample).tolist()).reshape(-1, img_size_target, img_size_target, 1)
    return x_train,y_train,x_valid,y_valid

# Training loop
for cv_index in range(cv_total):
    x_train, y_train, x_valid, y_valid =  get_cv_data(cv_index+1)
    history = model.fit(x_train, y_train,
                        validation_data=[x_valid, y_valid], 
                        epochs=epochs)

这是一个简单的代码片段，用于在代码中使用StratifiedKFold。只需相应地替换所需参数和超参数即可。

票数 1

页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://datascience.stackexchange.com/questions/45174

复制

相似问题

问如何使用sklearn train_test_split对数据进行分层以进行多标签分类？
EN

回答 4

Data Science用户

Data Science用户

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用sklearn train_test_split对数据进行分层以进行多标签分类？EN

回答 4

Data Science用户

Data Science用户

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用sklearn train_test_split对数据进行分层以进行多标签分类？
EN