文章/答案/技术大牛

发布

社区首页 >问答首页 >Python学习:多索引交叉验证

问Python学习:多索引交叉验证
EN

Stack Overflow用户

提问于 2018-12-03 10:33:41

回答 2查看 1.4K关注 0票数 2

嗨，我想使用的一个科学知识学习的功能交叉验证。我想要的是褶皱的分裂是由其中一个指标决定的。例如，假设我的数据以“月份”和“日”为索引：

Month    Day   Feature_1 
January   1      10
          2      20
February  1      30 
          2      40 
March     1      50 
          2      60 
          3      70 
April     1      80 
          2      90

假设我希望为每个验证提供1/4的数据作为测试集。我希望通过第一个指数，也就是月份来完成这个折页的分离。在这种情况下，测试集将是其中一个月，其余3个月将是培训集。例如，其中一个火车和测试分叉将如下所示：

TEST SET:
Month    Day   Feature_1 
January   1      10
          2      20

TRAINING SET:
Month    Day   Feature_1 
February  1      30 
          2      40 
March     1      50 
          2      60 
          3      70 
April     1      80 
          2      90

我该怎么做呢。谢谢。

python

pandas

scikit-learn

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-12-03 14:11:29

这被称为群分裂。查看在scikit中的用户指南-在这里了解更多关于它的知识。

..。为了衡量这一点，我们需要确保验证折叠中的所有样本都来自在配对训练折叠中根本没有表示的组。 ..。

您可以使用名称中包含组的GroupKFold或其他策略。样本可以是

# I am not sure about this exact command, 
# but after this, you should have individual columns for each index
df = df.reset_index()  

print(df)
Month     Day    Feature_1
January    1           10
January    2           20
February   1           30
February   2           40
March      1           50
March      2           60
March      3           70

groups = df['Month']

from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    # Here "train", "test" are indices of location, 
    # you need to use "iloc" to get actual values
    print("%s %s" % (train, test))  

    print(df.iloc[train, :])
    print(df.iloc[test, :])

Update：要将其传递到交叉验证方法中，只需将月份数据传递给其中的groups param。如下所示：

gkf = GroupKFold(n_splits=3)
y_pred = cross_val_predict(estimator, X_train, y_train, cv=gkf, groups=df['Month'])

票数 2

Stack Overflow用户

发布于 2018-12-03 10:56:41

使用-

indices = df.index.levels[0]

train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)
test_indices = np.setdiff1d(indices, train_indices)

train = df[np.in1d(df.index.get_level_values(0), train_indices)]
test = df[np.in1d(df.index.get_level_values(0), test_indices)]

输出

火车

              Feature_1
Month    Day           
January  1           10
         2           20
February 1           30
         2           40
March    1           50
         2           60
         3           70

测试

           Feature_1
Month Day           
April 1           80
      2           90

解释

indices = df.index.levels[0]从level=0索引- Index(['April', 'February', 'January', 'March'], dtype='object', name='Month')中获取所有唯一的

train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)样本占上一步选择的指标的75%

接下来，我们得到剩余的索引为test_indices。

最后，我们对列车进行了拆分并进行了相应的测试。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53591919

复制

相似问题

问Python学习:多索引交叉验证
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python学习:多索引交叉验证EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python学习:多索引交叉验证
EN