文章/答案/技术大牛

发布

问gbm函数下标越界
EN

Stack Overflow用户

提问于 2013-09-05 23:18:47

回答 2查看 14.1K关注 0票数 12

我有一个奇怪的问题。我已经成功地在我的笔记本电脑上运行了这段代码，但是当我第一次尝试在另一台机器上运行它时，我得到了这个警告分布未指定，假设伯努利...，这是我期望的，但后来我得到了这个错误：Error in object$var.levels[[i]] : subscript out of bounds

library(gbm)
gbm.tmp <- gbm(subxy$presence ~ btyme + stsmi + styma + bathy,
                data=subxy,
                var.monotone=rep(0, length= 4), n.trees=2000, interaction.depth=3,
                n.minobsinnode=10, shrinkage=0.01, bag.fraction=0.5, train.fraction=1,
                verbose=F, cv.folds=10)

有人能帮上忙吗？数据结构是完全相同的，相同的代码，相同的R。我甚至没有在这里使用下标。

编辑: traceback()

6: predict.gbm(model, newdata = my.data, n.trees = best.iter.cv)
5: predict(model, newdata = my.data, n.trees = best.iter.cv)
4: predict(model, newdata = my.data, n.trees = best.iter.cv)
3: gbmCrossValPredictions(cv.models, cv.folds, cv.group, best.iter.cv, 
       distribution, data[i.train, ], y)
2: gbmCrossVal(cv.folds, nTrain, n.cores, class.stratify.cv, data, 
       x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, 
       n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, 
       group)
1: gbm(subxy$presence ~ btyme + stsmi + styma + bathy, data = subxy,var.monotone = rep(0, length = 4), n.trees = 2000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.01, bag.fraction = 0.5, train.fraction = 1, verbose = F, cv.folds = 10)

会不会是因为我把保存的R工作区移到了另一台机器上？

编辑2:好的，我已经在运行代码的机器上更新了gbm包，现在我得到了相同的错误。因此，在这一点上，我认为旧的gbm包可能没有检查到位，或者新版本有一些问题。我对gbm的理解还不够好，不能说。

gbm

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-09-06 02:34:22

这只是一种预感，因为我看不到你的数据，但我相信当你的测试集中存在不存在于训练集中的可变级别时，就会发生错误。

当因子变量的级别数很高，或者某个级别的实例数很少时，很容易发生这种情况。

由于您使用的是CV折叠，因此其中一个循环上的抗拒设置可能对训练数据具有外部水平。

我的建议是：

A)使用model.matrix()对因子变量进行一次性编码

B)不断设置不同的种子，直到你得到一个不会发生这个错误的CV拆分。

编辑:是的，有了那个回溯，你的第三个CV坚持在它的测试集中有一个在训练中不存在的因子水平。所以预测函数看到了一个异常值，不知道该怎么做。

编辑2:这里有一个简单的例子来说明我所说的“因素水平不在测试集中”是什么意思。

#Example data with low occurrences of a factor level:

set.seed(222)
data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))
data$x2 = as.factor(data$x2)
data

      y         x1 x2
 [1,] 1 -0.2468959  2
 [2,] 0 -1.2155609  6
 [3,] 0  1.5614051  1
 [4,] 0  0.4273102  5
 [5,] 1 -1.2010235  5
 [6,] 1  1.0524585  8
 [7,] 0 -1.3050636  6
 [8,] 0 -0.6926076  4
 [9,] 1  0.6026489  3
[10,] 0 -0.1977531  7

#CV fold.  This splits a model to be trained on 80% of the data, then tests against the remaining 20%.  This is a simpler version of what happens when you call gbm's CV fold.

CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)
CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]

#build a model on the training... 

CV_model = lm(y ~ ., data = CV_train)
summary(CV_model)
#note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2

CV_test$x2
#in the test set, there are only levels 1 and 2.

#attempt to predict on the test set
predict(CV_model, CV_test)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
factor x2 has new levels 1, 2

票数 13

Stack Overflow用户

发布于 2017-01-24 03:21:45

我遇到了同样的问题，并最终通过更改gbm包中一个名为predict.gbm的隐藏函数来解决它。此函数通过交叉验证在划分的训练集上预测经过训练的gbm对象的测试集。

问题是通过的测试集应该只有与特性对应的列，所以您应该修改函数。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/18640169

复制

相似问题

问gbm函数下标越界
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问gbm函数下标越界EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问gbm函数下标越界
EN