搜索 - 腾讯云开发者社区-腾讯云

文章/答案/技术大牛

发布

来自专栏GiantPandaCV
如何看待 DeepSeek 发布的 MoE 大模型 DeepSeek-V2？
expert top-2 不同的是， DeepSeekV2 在模型架构上有非常具备影响力的创新：MLA + 大量的小 Expert 结合复杂 routing：[ shared 2 expert + top Tensor Parallel 需要重新设计训练 infra 挑战2：Unbalanced Pipeline Parallelism 训练 infra 挑战3：shared 2 expert + top MoE Shared Expert MoE 部分在别家都是 8/16 Expert 选 top2 时， DeepSeekV2 设计了一种 Shared Expert 2 + Routed Expert top 在并行时，还限制了单个 token （top-6）只能分配到至多 3 个 GPU 上。
82020编辑于 2025-02-03