blocks|key|131157|text|GPT2模型的模型并行化实现。根据我的理解，并行实现如下图所示。标记块是并行计算的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|131158|📷|atomic|offset|length|131159|​|131160|图(a)MLP|131161|F和g是共轭的，f是前过的恒等算子，在后传是全约的，g是前推的全约，后的是恒等式。|blockquote|131162|类似地，自我注意块的工作如下所示|131163|131164|131165|从纸的实验结果来看，12亿个参数适合于单个GPU，其中80亿个参数需要8个GPU(8种方式)。|131166|96是常量，用作每个头的隐藏大小。根据纸中的表2，隐藏大小可能基于参数计数。|131167|entityMap|0|IMAGE|mutability|IMMUTABLE|imageUrl|https://developer.qcloudimg.com/http-save/yehe-900000/78d97e29ebadefe5374f07584d7ad112.png|imageAlt|1|https://developer.qcloudimg.com/http-save/yehe-900000/e89b0f6ecfd6c4fb48f18a09e8770a8b.png|2|LINK|MUTABLE|url|https://arxiv.org/pdf/1909.08053.pdf|3^0|0|0|1|0|0|0|0|0|0|0|1|1|0|0|1|1|2|0|J|1|3|0^^$0|@$1|2|3|4|5|6|7|1C|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|1D|8|@]|9|@$E|1E|F|1F|1|1G]]|A|$]]|$1|G|3|H|5|6|7|1H|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|1I|8|@]|9|@]|A|$]]|$1|K|3|L|5|M|7|1J|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|1K|8|@]|9|@]|A|$]]|$1|P|3|C|5|D|7|1L|8|@]|9|@$E|1M|F|1N|1|1O]]|A|$]]|$1|Q|3|H|5|6|7|1P|8|@]|9|@]|A|$]]|$1|R|3|S|5|6|7|1Q|8|@]|9|@$E|1R|F|1S|1|1T]]|A|$]]|$1|T|3|U|5|6|7|1U|8|@]|9|@$E|1V|F|1W|1|1X]]|A|$]]|$1|V|3|-4|5|6|7|1Y|8|@]|9|@]|A|$]]]|W|$X|$5|Y|Z|10|A|$11|12|13|-4]]|14|$5|Y|Z|10|A|$11|15|13|-4]]|16|$5|17|Z|18|A|$19|1A]]|1B|$5|17|Z|18|A|$19|1A]]]]

Model Parallelism implementation for GPT2 model.
As per my understanding, parallelism implemented as shown in the below picture. Marked blocks are computed parallel.
<a href="https://i.stack.imgur.com/KFhte.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/KFhte.png" alt="enter image description here" /></a>
Fig (a)MLP, says
<blockquote>
f and g are conjugate, f is an identity operator in the forward pass
and all-reduce in the backward pass while g is an all-reduce in
forward and identity in backward.
</blockquote>
Similarly self attention block works as shown below picture
<a href="https://i.stack.imgur.com/H00ht.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/H00ht.png" alt="enter image description here" /></a>
From the experiment given in the <a href="https://arxiv.org/pdf/1909.08053.pdf" rel="nofollow noreferrer">paper</a>, 1.2 billion parameters fits on single GPU, where as 8 billion parameters requires 8 GPUs (8-ways).
96 is the constant number used as the hidden size per head.
As per Table 2 from <a href="https://arxiv.org/pdf/1909.08053.pdf" rel="nofollow noreferrer">paper</a>, Hidden size might be based on parameter count.

I am trying to understand the implementation details of <a href="https://github.com/NVIDIA/Megatron-LM#inverse-cloze-task-ict-pretraining" rel="nofollow noreferrer">MegatronLM</a>, which has both model and data parallel. On their <a href="https://nv-adlr.github.io/MegatronLM" rel="nofollow noreferrer">site</a> or in their research <a href="https://arxiv.org/pdf/1909.08053.pdf" rel="nofollow noreferrer">paper</a>, they mentioned how they used intra-layer parallel which is similar to mesh TensorFlow. I am confused with some details.
As shown in the picture below, my understanding is that the computation inside each of the 4 red circles can be parallelized by intra-layer splitting, but MLP must happen after self-attention, so only 2 red circled blocks can be parallelized at the same time. The paper says the model parallel is 8-way. My first question is, Does this indicate they split each block into 4 intra-layer parts (8/2)?
(8-way divided by 2-blocks)
<a href="https://i.stack.imgur.com/5agbD.jpg" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/5agbD.jpg" alt="enter image description here" /></a>
The paper also mentioned
<blockquote>
To have consistent GEMM sizes in the self attention layer, the hidden size per attention head is kept constant at 96 while the number of heads and layers are varied to obtain configurations ranging from 1 billion to 8 billion parameters.
</blockquote>
My second question is What does the 96 hidden size refer to here?
I am totally new to distributed training, I probably misunderstood something. Any clarification on this topic would be very appreciated! Thanks!

How is model parallelism implemented for GPT2 in MegatronLM?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我试图了解的实现细节，它具有模型和数据并行。在他们的或他们的研究中，他们提到了他们是如何使用层内并行的，这类似于网格TensorFlow。我对一些细节感到困惑。如下图所示，我的理解是，四个红色圆圈内部的计算可以通过层内分裂并行化，但是MLP必须在自我关注之后发生，因此只能同时并行两个红色圈块。文中说，模型平行为8路.我的第一个问题是，是否表明它们将每个块分成4个层内部分(8/2)?。(8条路除以2

问GPT2如何在MegatronLM中实现模型并行？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问GPT2如何在MegatronLM中实现模型并行？EN