文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么火花作业在并行执行多个Hive脚本时失败？

问为什么火花作业在并行执行多个Hive脚本时失败？
EN

Stack Overflow用户

提问于 2017-04-20 11:31:02

回答 2查看 731关注 0票数 1

我有25个蜂巢脚本，每个有200个蜂巢查询。我在我的aws集群中使用spark命令运行每个hql。我正在运行所有的火花-sql命令并行使用&操作符。我能够在tez上成功地使用单元运行相同的hql。我也在尝试使用spark来提高性能。但是，使用spark，只有2-3个脚本执行得很好；其余的sql由于对等错误设置的连接而失败。我相信，这是由于资源不足，在纱线集群的火花。

当我观察到纱线控制台时，我可以看到它正在利用集群的全部内存，尽管我在命令中指定了executor和驱动程序内存。

能帮我找出这个问题的确切原因吗?

下面是我的EMR集群配置：

Data Nodes : 6
RAM per Node : 56 GB
Cores per Node: 32
Instance Type: M4*4xLarge

在unix中使用的命令：

spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql1.hql & spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql2.hql & spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql3.hql ..... spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql25.hql

当我并行运行上述所有命令时，只有2到3个作业在正确执行，其余作业由于以下错误而失败。

 05:>            (0 + 0) / 30800]^M[Stage 904:=>       (6818 + 31) / 30800][Stage 905:>            (0 + 0) / 30800]^M[Stage 904:==>      (7743 + 31) / 30800][Stage 905:>            (0 + 0) / 30800]^M[Stage 904:==>      (8271 + 32) / 30800][Stage 905:>            (0 + 0) / 30800]17/04/13 11:35:10 WARN TransportChannelHandler: Exception in connection from /10.134.22.114:47550
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:745)
17/04/13 11:35:10 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.134.22.114:47550 is closed
17/04/13 11:35:10 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(53329,61600,Map(ip-10-134-22-6.eu-central-1.compute.internal -> 12262, ip-10-134-22-67.eu-central-1.compute.internal -> 16940, ip-10-134-22-106.eu-central-1.compute.internal -> 17876, ip-10-134-22-46.eu-central-1.compute.internal -> 16400, ip-10-134-22-114.eu-central-1.compute.internal -> 14902, ip-10-134-22-105.eu-central-1.compute.internal -> 44820)) to AM was unsuccessful
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)

hql

apache-spark-sql

emr

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-04-20 12:36:20

我相信，这是由于资源不足，在纱线集群的火花。

我也这么认为，并强烈建议使用see来查看资源是如何使用的。

不管您在same中看到了什么，我都做了一些计算，看起来您的资源实在太少了，无法同时运行所有的25个脚本。

给予..。

Data Nodes : 6
RAM per Node : 56 GB
Cores per Node: 32
Instance Type: M4*4xLarge

看起来你有6x56GB= 336 GB和6x32个核心=192个核心。

在以下命令之后：

火花-sql-主纱-num-执行器12 -执行器-存储器20G -执行器-核心15 -驱动器-内存10G -f hql1.hql

您已经预留了240 GB和180个核心，这是可用资源的一半以上，而且只用于第一个spark-sql。

我认为问题在于将spark-sql放在背景中的单一spark-sql，如果您有25 spark-sql，您就会看到缺少资源的问题。我没被吓到。

票数 1

Stack Overflow用户

发布于 2017-04-20 16:43:12

将火花动态内存分配改为false应该解决这个问题。

即使我们在命令中设置了executor内存，如果集群中的资源可用，spark也会动态地分配内存。若要将内存使用限制为仅限于执行器内存，请将火花动态内存分配参数设置为false。

您可以在星火配置文件中直接更改它，也可以将其作为配置参数传递给命令。

spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G --conf spark.dynamicAllocation.enabled=false -f hive1.hql

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43518342

复制

相似问题

问为什么火花作业在并行执行多个Hive脚本时失败？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么火花作业在并行执行多个Hive脚本时失败？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么火花作业在并行执行多个Hive脚本时失败？
EN