实际并行化插入符号取决于R、插入符号和doMC包。如Parallelizing Caret code所述
有没有人像我一样在类似的环境下工作?当R插入符号并行化正常工作时,最大的R版本是多少?
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] caret_6.0-52 ggplot2_1.0.1 lattice_0.20-31 doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 RStudioAMI_0.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 magrittr_1.5 splines_3.2.1 MASS_7.3-41 munsell_0.4.2 colorspace_1.2-6
[7] minqa_1.2.4 car_2.1-0 stringr_1.0.0 plyr_1.8.3 tools_3.2.1 pbkrtest_0.4-2
[13] nnet_7.3-9 grid_3.2.1 gtable_0.1.2 nlme_3.1-120 mgcv_1.8-6 quantreg_5.19
[19] MatrixModels_0.4-1 gtools_3.5.0 lme4_1.1-9 digest_0.6.8 Matrix_1.2-0 nloptr_1.0.4
[25] reshape2_1.4.1 codetools_0.2-11 stringi_0.5-5 BradleyTerry2_1.0-6 scales_0.3.0 stats4_3.2.1
[31] SparseM_1.7 brglm_0.5-9 proto_0.3-10更新1:我的代码如下:
library(doMC) ; registerDoMC(cores=4)
library(caret)
classification_formula <- as.formula(paste("target" ,"~",
paste(names(m_input_data)[!names(m_input_data)=='target'],collapse="+")))
CVfolds <- 2
CVreps <- 5
ma_control <- trainControl(method = "repeatedcv",
number = CVfolds,
repeats = CVreps ,
returnResamp = "final" ,
classProbs = T,
summaryFunction = twoClassSummary,
allowParallel = TRUE,verboseIter = TRUE)
rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))
rf <- train(classification_formula , data = m_input_data , method = "rf", metric="ROC" ,trControl = ma_control, tuneGrid = rf_tuneGrid , ntree = 101)更新2:当我从命令行运行时,当我从Rstudio运行这些脚本时,只有一个核心在工作,并行正在工作,因为我看到4个进程通过顶部。但在此之后的一秒钟,错误发生了:
Error in names(resamples) <- gsub("^\\.", "", names(resamples)) :
attempt to set an attribute on NULL 更新4:
嗨,似乎问题出在被终止的R会话中。每次启动AWS实例时,我都会运行R代码,现在刷新R引擎。现在,每次我刷新Rstudio浏览器时,我都会执行Session ->重启R。看起来它在运行。我现在正在检查从Ubuntu命令行运行脚本是否也是如此。
通常情况下,它在运行时没有结束。Caret在数据级别上是并行的。这意味着它能够在不同的过程中处理每个重采样。但如果样本仍然很大( 100,000 /2(折叠数= 2) X 2,000个特征),这可能很难为每个处理器单元完成。我说的对吗?
我认为并行性必须在算法级别上。这意味着每个算法都可能在多个内核上运行。如果这样的算法实现在插入符号中可用?
发布于 2015-09-14 16:25:43
我有针对Linux平台的最新版本,R版本3.2.2 (2015-08-14,Fire Safety),并行化工作正常。你能提供你不能并行工作的代码吗?
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] kernlab_0.9-22 doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 caret_6.0-52 ggplot2_1.0.1 lattice_0.20-33
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 compiler_3.2.2 nloptr_1.0.4 plyr_1.8.3 tools_3.2.2 digest_0.6.8
[7] lme4_1.1-9 nlme_3.1-122 gtable_0.1.2 mgcv_1.8-7 Matrix_1.2-2 brglm_0.5-9
[13] SparseM_1.7 proto_0.3-10 BradleyTerry2_1.0-6 stringr_1.0.0 gtools_3.5.0 MatrixModels_0.4-1
[19] stats4_3.2.2 grid_3.2.2 nnet_7.3-10 minqa_1.2.4 reshape2_1.4.1 car_2.0-26
[25] magrittr_1.5 scales_0.3.0 codetools_0.2-11 MASS_7.3-43 splines_3.2.2 pbkrtest_0.4-2
[31] colorspace_1.2-6 quantreg_5.18 stringi_0.5-5 munsell_0.4.2 我在我的本地机器上使用了您的BreastCancer数据集的代码,它可以并行工作,没有任何问题。我使用的是RStudio版本0.98.1103。
library(caret)
library(mlbench)
data(BreastCancer)
library(doMC)
registerDoMC(cores=2)
classification_formula <- as.formula(paste("Class" ,"~",
paste(names(BreastCancer)[!names(BreastCancer)=='Class'],collapse="+")))
CVfolds <- 2
CVreps <- 5
ma_control <- trainControl(method = "repeatedcv",
number = CVfolds,
repeats = CVreps ,
returnResamp = "final" ,
classProbs = T,
summaryFunction = twoClassSummary,
allowParallel = TRUE,verboseIter = TRUE)
rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))
#Notice, it might be easier just to use Class~.
#instead of classification_formula
rf <- train(classification_formula ,
data = BreastCancer ,
method = "rf",
metric="ROC" ,
trControl = ma_control,
tuneGrid = rf_tuneGrid ,
ntree = 101)
> rf
Random Forest
699 samples
10 predictors
2 classes: 'benign', 'malignant'
No pre-processing
Resampling: Cross-Validated (2 fold, repeated 5 times)
Summary of sample sizes: 341, 342, 342, 341, 342, 341, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec ROC SD Sens SD Spec SD
2 0.9867820 1.0000000 0.0000000 0.005007691 0.000000000 0.000000000
8 0.9899107 0.9549550 0.9640196 0.002243649 0.006714919 0.017247716
14 0.9907072 0.9558559 0.9631933 0.003028258 0.012345228 0.008019979
20 0.9909514 0.9635135 0.9556513 0.003268291 0.006864342 0.010471005
26 0.9911480 0.9630631 0.9539706 0.003384987 0.005113930 0.010628533
32 0.9911485 0.9657658 0.9522969 0.002973508 0.004842197 0.004090206
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 32.
> https://stackoverflow.com/questions/32514370
复制相似问题