文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么我收到GroupShuffleSplit (列车测试拆分)的错误

问为什么我收到GroupShuffleSplit (列车测试拆分)的错误
EN

Stack Overflow用户

提问于 2021-07-10 02:49:10

回答 1查看 30关注 0票数 0

我有2个数据集，并应用了5个不同的ML模型。

数据集1：

def dataset_1():
    ...
    ...
    bike_data_hours = bike_data_hours[:500]
    X = bike_data_hours.iloc[:, :-1].values
    y = bike_data_hours.iloc[:, -1].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    return X_train, X_test, y_train.reshape(-1, 1), y_test.reshape(-1, 1)

该形状为(400, 14) (100, 14) (400, 1) (100, 1)。dtypes: object (int64，float64)。

数据集2：

def dataset_2():
    ...
    ...
    final_movie_df = final_movie_df[:500]
    X = final_movie_df.iloc[:, :-1]
    y = final_movie_df.iloc[:, -1]
    gs = GroupShuffleSplit(n_splits=2, test_size=0.2)
    train_ix, test_ix = next(gs.split(X, y, groups=X.UserID))
    X_train = X.iloc[train_ix]
    y_train = y.iloc[train_ix]
    X_test = X.iloc[test_ix]
    y_test = y.iloc[test_ix]
    return X_train.shape, X_test.shape, y_train.values.reshape(-1,1).shape, y_test.values.reshape(-1,1).shape

该形状为(400, 25) (100, 25) (400, 1) (100, 1)。dtypes: object (int64，float64)。

我使用的是不同的模型。代码是

    X_train, X_test, y_train, y_test = dataset
    fold_residuals, fold_dfs = [], []
    kf = KFold(n_splits=k, shuffle=True)
    for train_index, _ in kf.split(X_train):
        if reg_name == "RF" or reg_name == "SVR":
            preds = regressor.fit(X_train[train_index], y_train[train_index].ravel()).predict(X_test)
        elif reg_name == "Knn-5":
            preds = regressor.fit(X_train[train_index], np.ravel(y_train[train_index], order="C")).predict(X_test)
        else:
            preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)

但我得到了一个常见的错误，如this、this和this。我已经看过了所有这些帖子，但对错误一无所知。我已经使用了iloc和values作为访问链接的解决方案。

preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
  File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3030, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([  0,   1,   3,   4,   5,   6,   7,   9,  10,  11,\n            ...\n            387, 388, 389, 390, 391, 392, 393, 395, 397, 399],\n           dtype='int64', length=320)] are in the [columns]"

在这里，如果我使用train_test_split而不是GroupShuffleSplit，那么代码就可以工作了。但是，我希望使用基于UserID的GroupShuffleSplit，这样同一用户就不会同时用于训练和测试。你能告诉我当我使用GroupShuffleSplit时如何解决这个问题吗？

您能告诉我为什么在dataset_1完全正常工作(并且shape和dtypes)对于两个数据集都是相同的情况下，我得到了dataset_2的错误。

python

python-3.x

machine-learning

scikit-learn

train-test-split

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-07-10 02:57:20

您的dataset_2必须使用values。是否进行更改

    X_train = X.iloc[train_ix].values
    y_train = y.iloc[train_ix].values
    X_test = X.iloc[test_ix].values
    y_test = y.iloc[test_ix].values
    return X_train.shape, X_test.shape, y_train.reshape(-1,1).shape, y_test.reshape(-1,1).shape

希望现在能起作用

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68321527

复制

相似问题

问为什么我收到GroupShuffleSplit (列车测试拆分)的错误
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我收到GroupShuffleSplit (列车测试拆分)的错误EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我收到GroupShuffleSplit (列车测试拆分)的错误
EN