我们有一个Flink作业,从hive读取数据,并与来自kafka的流数据连接。
它可以在Yarn上成功运行,但当我们在Kubernetes上使用完全相同的内存设置运行它时,它失败并出现错误
java.io.IOException: Insufficient number of network buffers: required 2, but only 1 available. The total number of network buffers is currently set to 57343 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max'.
\tat org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalCreateBufferPool(NetworkBufferPool.java:340)
\tat org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:322)
\tat org.apache.flink.runtime.io.network.partition.ResultPartitionFactory.lambda$createBufferPoolFactory$0(ResultPartitionFactory.java:215)
\tat org.apache.flink.runtime.io.network.partition.ResultPartition.setup(ResultPartition.java:139)
\tat org.apache.flink.runtime.taskmanager.ConsumableNotifyingResultPartitionWriterDecorator.setup(ConsumableNotifyingResultPartitionWriterDecorator.java:88)
\tat org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:869)
\tat org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:635)
\tat org.apache.flink.runtime.taskmanager.Task.run(Task.java:543)
\tat java.lang.Thread.run(Thread.java:748)我按照指令增加了taskmanager.memory.network.fraction,但由于对象模型的原因,它失败了:
Caused by: java.lang.OutOfMemoryError: Java heap space
\tat java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
\tat java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
\tat di.flink.shadow.org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
\tat di.flink.shadow.org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1166)
\tat di.flink.shadow.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
\tat org.apache.flink.formats.parquet.utils.ParquetRecordReader.readNextRecord(ParquetRecordReader.java:226)
\tat org.apache.flink.formats.parquet.utils.ParquetRecordReader.reachEnd(ParquetRecordReader.java:207)
\tat org.apache.flink.formats.parquet.ParquetInputFormat.reachedEnd(ParquetInputFormat.java:233)
\tat org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:193)
\tat org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:719)
\tat org.apache.flink.runtime.taskmanager.Task.run(Task.java:543)
\tat java.lang.Thread.run(Thread.java:748)我甚至在Kubernetes上将任务管理器进程大小从16 by增加到32 by,仍然出现相同的错误,通过查看Kubernetes pod资源使用指标,有3-5个pod消耗的内存比平均水平多得多,并且它们的内存使用量在运行时保持增长。
我想知道Kubernetes上是否有任何已知的内存使用问题,特别是网络缓冲区,我可以在哪里检查这些指标以进行调试?
发布于 2021-03-17 07:48:07
我发现了这个问题,在docker入口点脚本中,任务管理器就可以了
TASK_MANAGER_NUMBER_OF_TASK_SLOTS=${TASK_MANAGER_NUMBER_OF_TASK_SLOTS:-$(grep -c ^processor /proc/cpuinfo)}然而,要覆盖flink-conf中的taskmanager.numberOfTaskSlots,/proc/cpuinfo中包含所有物理cpu核心,而不仅仅是分配给容器的核心,因此在我的例子中,taskmanager.numberOfTaskSlots被设置为32,导致一些容器需要完成大部分工作,而其余的则是空闲的。
https://stackoverflow.com/questions/66648804
复制相似问题