我们正在尝试为我们的项目使用XGBoost-Spark &我们在使用大数据训练模型时遇到了问题。但同样适用于小数据。训练阶段大约运行2个小时,所有任务几乎同时完成。在大约1200个任务结束后,所有剩余的执行器都开始失败,我们从所有这些执行器中看到了相同的错误。注意:我们是刚接触机器学习的数据工程师,正在尝试创建由数据科学家创建的原型的生产版本。我们对机器学习概念的接触非常有限。
Jars Used - xgboost4j-spark-0.72-criteo-20180518_2.11.jar & xgboost4j-0.72-criteo-20180518_2.10-linux.jar来自其中一个执行器日志的错误:
Container id: container_e109_1529510504264_41133_01_000223
Exit code: 255
Shell output: main : command provided 1
main : run as user is svccaddv
main : requested yarn user is svccaddv
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /u/applic/data/hdfs1/hadoop/yarn/local/nmPrivate/application_1529510504264_41133/container_e109_1529510504264_41133_01_000223/container_e109_1529510504264_41133_01_000223.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 255. Last 4096 bytes of stderr
:ter_prune.cc:74: tree pruning end, 1[23:49:45] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth= roots, 126 extra nodes, 0 pruned nodes, max_depth=6
6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, [23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
0 pruned nodes, max_depth=6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
6
: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
Socket RecvAll Error:Connection reset by peer
Socket RecvAll Error:Connection reset by peer我们正在使用的代码片段:
MLUtils.saveAsLibSVMFile(newtrainingData.rdd, inputTrainPath)
val trainSess = spark.sqlContext.read.format("libsvm").option("numFeatures", "10").load(inputTrainPath)
val paramMap = List(
"eta" -> 0.003,
"max_depth" -> 6,
"subsample" -> 0.8,
"colsample_bytree" -> 0.8,
"silent" -> 0,
"numEarlyStoppingRounds" -> 100,
"objective" -> "reg:linear").toMap
val numRound = 1500
val xgboostModel = XGBoost.trainWithDataFrame(trainSess, paramMap, numRound, nWorkers = trainSess.rdd.getNumPartitions, useExternalMemory = false)表大小约21 GB (存储为ORC w SNAPPY Compression) SVM文件大小约160 GB用于训练的spark阶段输入大小约460 GB在培训阶段产生的任务- 4044个执行器- 515 (大约-我们使用动态分配)执行器-核心-4个执行器-mem- 4G执行器-mem-开销- 1200 MB驱动程序-mem- 10G
发布于 2018-07-12 22:13:32
我们找到了一种解决方法。我们使用coalesce()减少了分区的数量,从而减少了任务的数量。以前我们使用repartition()来减少分区,但仍然得到了错误。但是,即使使用coalesce,如果分区数超过1000,作业也会失败。对于一些中型数据集,作业在1200和1500分区下运行良好。但是我们坚持使用1000个分区&作业运行得很好。以前,我们将分区增加到3k或4k,以提高并行度,从而提高性能。但使用1k分区时,性能并不差。
对于正在寻找其他解决方法的人,请参考XGBoost team - https://github.com/dmlc/xgboost/issues/3462给出的建议(我没有尝试过这个)。
https://stackoverflow.com/questions/51269503
复制相似问题