我正在尝试用不同的参数来拟合许多xgboost模型(例如,参数调优)。需要并行运行它们以减少时间。但是,在运行%dopar%命令时,我得到以下错误:Error in unserialize(socklist[[n]]) : error reading from connection。
下面是一个可重现的例子。它必须与xgboost有关,因为涉及全局变量的任何其他计算都在%dopar%循环中进行。有人能指出这种方法的缺失/错误之处吗?
#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)
#### Data Sim
n = 1000
X = cbind(runif(n,10,20), runif(n,0,10))
y = 10 + 2*X[,1] + 3*X[,2] + rnorm(n,0,1)
#### Init XGB
train = xgb.DMatrix(data = X[-((n-10):n),], label = y[-((n-10):n)])
test = xgb.DMatrix(data = X[(n-10):n,], label = y[(n-10):n])
watchlist = list(train = train, test = test)
#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
clusterEvalQ(cl, {
library(xgboost)
})
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
xgb.train(data = train, watchlist = watchlist, max_depth=i, nrounds = 1000, early_stopping_rounds = 10)$best_score
# if xgb.train is replaced with anything else, e.g. 1+y, it works
}
stopCluster(cl) 发布于 2021-03-18 06:47:54
正如HenrikB在注释中指出的那样,xgb.DMatrix对象不能用于并行化。为了解决这个问题,我们可以在foreach中创建对象
#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)
#> Loading required package: iterators
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
# BRING CREATION OF XGB MATRIX INSIDE OF foreach
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist = list(dtrain = dtrain, dtest = dtest)
param <- list(max_depth = i, eta = 0.01, verbose = 0,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
bst$best_score
}
stopCluster(cl)
pred
#> [[1]]
#> dtest-auc
#> 0.892138
#>
#> [[2]]
#> dtest-auc
#> 0.987974
#>
#> [[3]]
#> dtest-auc
#> 0.986255
#>
#> [[4]]
#> dtest-auc
#> 1
#> ...基准测试:
因为xgboost.train已经并行化了,所以看看用于xgboost的线程与用于并行运行调优轮次的线程之间的速度差异可能很有趣。
为此,我封装了一个函数,并对不同的组合进行了基准测试:
tune_par <- function(xgbthread, doparthread) {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
#### Init parallel & run
cl = parallel::makeCluster(doparthread, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
clusterEvalQ(cl, {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
})
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist = list(dtrain = dtrain, dtest = dtest)
param <- list(max_depth = i, eta = 0.01, verbose = 0, nthread = xgbthread,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
bst$best_score
}
stopCluster(cl)
pred
}在我的测试中,当xgboost使用更多的线程,而使用更少的线程并行运行调优轮次时,评估速度会更快。什么效果最好可能取决于系统规格和数据量。
# 16 logical cores split between xgb threads and threads in dopar cluster:
microbenchmark::microbenchmark(
xgb16par1 = tune_par(xgbthread = 16, doparthread = 1),
xgb8par2 = tune_par(xgbthread = 8, doparthread = 2),
xgb4par4 = tune_par(xgbthread = 4,doparthread = 4),
xgb2par8 = tune_par(xgbthread = 2, doparthread = 8),
xgb1par16 = tune_par(xgbthread = 1,doparthread = 16),
times = 5
)
#> Unit: seconds
#> expr min lq mean median uq max neval cld
#> xgb16par1 2.295529 2.431110 2.500170 2.519277 2.527914 2.727021 5 a
#> xgb8par2 2.301189 2.308377 2.407767 2.363422 2.465446 2.600402 5 a
#> xgb4par4 2.632711 2.778304 2.875816 2.825471 2.849003 3.293593 5 b
#> xgb2par8 4.508485 4.682284 4.752776 4.810461 4.822566 4.940085 5 c
#> xgb1par16 8.493378 8.550609 8.679931 8.768008 8.779718 8.807943 5 dhttps://stackoverflow.com/questions/66661306
复制相似问题