首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用GPU的火花:如何强制每个执行器执行一个任务

使用GPU的火花:如何强制每个执行器执行一个任务
EN

Stack Overflow用户
提问于 2017-02-17 16:45:17
回答 1查看 2.8K关注 0票数 4

我在一个有N个从节点的集群上运行了Spark2.1.0。每个节点有16个核心(8个核心/cpu和2个cpu)和1个GPU。我想使用映射进程来启动GPU内核。因为每个节点只有一个GPU,所以我需要确保两个执行者不在同一个节点上(同时)尝试使用GPU,并且两个任务不会同时提交到同一个执行器。

如何强制每个节点拥有一个执行程序?

我尝试了以下几点:

-背景:spark.executor.cores 16 in $SPARK_HOME/conf/spark-defaults.conf

-背景:SPARK_WORKER_CORES = 16SPARK_WORKER_INSTANCES = 1 in $SPARK_HOME/conf/spark-env.sh

和,

--直接在我的spark脚本中设置conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) (当我希望N=6用于调试时)。

这些选项按照需要在不同的节点上创建了6个执行程序,但是似乎每个任务都分配给了相同的执行器。

下面是我最近输出的一些片段(这让我相信它应该能像我想的那样工作)。

代码语言:javascript
复制
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/0 on worker-20170217110853-10.128.14.208-35771 (10.128.14.208:35771) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/0 on hostPort 10.128.14.208:35771 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/1 on worker-20170217110853-10.128.9.95-59294 (10.128.9.95:59294) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/1 on hostPort 10.128.9.95:59294 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/2 on worker-20170217110853-10.128.3.71-47507 (10.128.3.71:47507) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/2 on hostPort 10.128.3.71:47507 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/3 on worker-20170217110853-10.128.9.96-50800 (10.128.9.96:50800) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/3 on hostPort 10.128.9.96:50800 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/4 on worker-20170217110853-10.128.3.73-60194 (10.128.3.73:60194) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/4 on hostPort 10.128.3.73:60194 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/5 on worker-20170217110853-10.128.3.74-42793 (10.128.3.74:42793) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/5 on hostPort 10.128.3.74:42793 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/1 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/3 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/4 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/2 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/0 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/5 is now RUNNING
17/02/17 11:09:11 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 

My RDD has 6 partitions.

重要的是启动了6个执行者,每个执行者都有一个不同的IP地址,每个执行程序都有16个核心(与我所期望的完全相同)。短语My RDD has 6 partitions.是在重新划分我的RDD之后从我的代码中的一个打印语句(以确保每个执行器有一个分区)。

然后,这个发生了..。这6个任务中的每一个都被发送到同一个执行器!

代码语言:javascript
复制
17/02/17 11:09:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/17 11:09:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.128.9.95:34059) with ID 1
17/02/17 11:09:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.128.9.95, executor 1, partition 0, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.128.9.95, executor 1, partition 1, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.128.9.95, executor 1, partition 2, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.128.9.95, executor 1, partition 3, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.128.9.95, executor 1, partition 4, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.128.9.95, executor 1, partition 5, PROCESS_LOCAL, 6095 bytes)

为什么?和如何修复?问题是,此时,所有6个任务都在竞争同一个GPU,而GPU不能共享。

EN

回答 1

Stack Overflow用户

发布于 2017-03-27 12:42:27

我试着在萨姆森·沙罗特的评论中提出这个建议,但这些建议似乎行不通。然而,我发现:http://spark.apache.org/docs/latest/configuration.html#scheduling,其中包括spark.task.cpus。如果我将其设置为16,而spark.executor.cores设置为16,那么我似乎得到了分配给每个执行器的一个任务。

票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/42303188

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档