
Chris Lattner Modular创始人,LLVM/Clang/Swift/MLIR之父
DeepSeek’s recent breakthrough has upended assumptions about AI’s compute demands, showing that better hardware utilization can dramatically reduce the need for expensive GPUs.
This is Part 1 of Modular’s “Democratizing AI Compute” series. For more, see:
Part 1: DeepSeek’s Impact on AI (this article)
Part 2: What exactly is “CUDA”?
Part 3: How did CUDA succeed?
Part 4: CUDA is the incumbent, but is it any good?
Part 5: What about CUDA C++ alternatives like OpenCL?
Part 6: What about AI compilers (TVM and XLA)?
Part 7: What about Triton and Python eDSLs?
For years, leading AI companies have insisted that only those with vast compute resources can drive cutting-edge research, reinforcing the idea that it is “hopeless to catch up” unless you have billions of dollars to spend on infrastructure. But DeepSeek’s success tells a different story: novel ideas can unlock efficiency breakthroughs to accelerate AI, and smaller, highly focused teams to challenge industry giants–and even level the playing field.
We believe DeepSeek’s efficiency breakthrough signals a coming surge in demand for AI applications. If AI is to continue advancing, we must drive down the Total Cost of Ownership (TCO)–by expanding access to alternative hardware, maximizing efficiency on existing systems, and accelerating software innovation. Otherwise, we risk a future where AI’s benefits are bottlenecked–either by hardware shortages or by developers struggling to effectively utilize the diverse hardware that is available.
This isn’t just an abstract problem–it's a challenge I’ve spent my career working to solve.
My passion for compute + developer efficiency
I've spent the past 25 years working to unlock computing power for the world. I founded and led the development of LLVM, a compiler technology that opened CPUs to new applications of compiler technology. Today, LLVM is the foundation for performance-oriented programming languages like C++, Rust, Swift and more. It powers nearly all iOS and Android apps, as well as the infrastructure behind major internet services from Google and Meta.
This work paved the way for several key innovations I led at Apple, including the creation of OpenCL, an early accelerator framework now widely adopted across the industry, the rebuild of Apple’s CPU and GPU software stack using LLVM, and the development of the Swift programming language. These experiences reinforced my belief in the power of shared infrastructure, the importance of co-designing hardware and software, and how intuitive, developer-friendly tools unlock the full potential of advanced hardware.
Falling in love with AI
In 2017, I became fascinated by AI’s potential and joined Google to lead software development for the TPU platform. At the time, the hardware was ready, but the software wasn’t functional. Over the next two and a half years, through intense team effort, we launched TPUs in Google Cloud, scaled them to ExaFLOPS of compute, and built a research platform that enabled breakthroughs like Attention Is All You Need and BERT.
Yet, this journey revealed deeper troubles in AI software. Despite TPUs' success, they remain only semi-compatible with AI frameworks like PyTorch–an issue Google overcomes with vast economic and research resources. A common customer question was, “Can TPUs run arbitrary AI models out of the box?” The hard truth? No–because we didn’t have CUDA, the de facto standard for AI development.
I’m not one to shy away from tackling major industry problems: my recent work has been the creation of next-generation technologies to scale into this new era of hardware and accelerators. This includes the MLIR compiler framework (widely adopted now for AI compilers across the industry) and the Modular team has spent the last 3 years building something special–but we’ll share more about that later, when the time is right.
How do GPUs and next-generation compute move forward?
Because of my background and relationships across the industry, I’m often asked about the future of compute. Today, countless groups are innovating in hardware (fueled in part by NVIDIA’s soaring market cap), while many software teams are adopting MLIR to enable new architectures. At the same time, senior leaders are questioning why–despite massive investments–the AI software problem remains unsolved. The challenge isn’t a lack of motivation or resources. So why does the industry feel stuck?
I don’t believe we are stuck. But we do face difficult, foundational problems.
To move forward, we need to better understand the underlying industry dynamics. Compute is a deeply technical field, evolving rapidly, and filled with jargon, codenames, and press releases designed to make every new product sound revolutionary. Many people try to cut through the noise to see the forest for the trees, but to truly understand where we’re going, we need to examine the roots—the fundamental building blocks that hold everything together.
This post is the first in a multipart series where we’ll help answer these critical questions in a straightforward, accessible way:
🧐 What exactly is CUDA?
🎯 Why has CUDA been so successful?
⚖️ Is CUDA any good?
❓ Why do other hardware makers struggle to provide comparable AI software?
⚡ Why haven’t existing technologies like Triton or OneAPI or OpenCL solved this?
🚀 How can we as an industry move forward?
I hope this series sparks meaningful discussions and raises the level of understanding around these complex issues. The rapid advancements in AI —like DeepSeek’s recent breakthroughs–remind us that software and algorithmic innovation are still driving forces. A deep understanding of low-level hardware continues to unlock "10x" breakthroughs.
AI is advancing at an unprecedented pace–but there’s still so much left to unlock. Together we can break it down, challenge assumptions, and push the industry forward. Let’s dive in!
-Chris
DeepSeek的最新突破颠覆了人们对AI计算需求的假设,证明通过提升硬件利用率可显著减少对昂贵GPU的依赖。
这是Modular“AI计算民主化”系列的第一部分。完整系列包括:
第一部分:DeepSeek对AI的影响(本文)
第二部分:究竟什么是“CUDA”?
第三部分:CUDA如何取得成功?
第四部分:CUDA是行业霸主,但它优秀吗?
第五部分:CUDA C++的替代方案如OpenCL表现如何?
第六部分:AI编译器(TVM与XLA)进展如何?
第七部分:Triton与Python嵌入式DSL为何重要?
打破算力垄断神话
多年来,AI领军企业坚称“只有掌握海量计算资源的机构才能推动前沿研究”,强化了“无巨额资金投入则追赶无望”的认知。但DeepSeek的成功揭示了另一条路径:通过创新思维实现效率突破,小型精锐团队同样能挑战行业巨头,甚至重塑竞争格局。
我们相信,DeepSeek的效率突破将引爆AI应用需求。若要让AI持续进步,必须从三方面降低总体拥有成本(TCO):
否则,AI发展将面临双重瓶颈:硬件短缺的物理限制,或开发者无法有效利用多样化算力架构的软件困境。这不仅是一个理论问题,更是我职业生涯致力解决的技术挑战。
我的技术征程:解锁算力与赋能开发者
过去25年,我的工作始终围绕“释放计算潜力”展开:
LLVM革命(2000年启动):作为编译器技术的开源基石,LLVM为CPU开启了新的可能性。如今它支撑着Rust/Swift等高性能语言、几乎所有iOS/Android应用,以及Google/Meta等巨头的核心服务。
苹果时代(2005-2017):主导创建OpenCL异构计算框架,用LLVM重构苹果GPU驱动栈,开发Swift语言。这些实践验证了“软硬协同设计”的价值,以及友好开发工具对释放硬件潜力的关键作用。
TPU攻坚战(2017-2019):在Google领导TPU软件栈开发,从零构建出支撑ExaFLOPS级超算的平台,支持了《Attention Is All You Need》和BERT等突破性成果。但这段经历暴露了AI生态的深层问题:即便强大如Google,仍需投入海量资源解决TPU与PyTorch的半兼容性问题。客户常问:“TPU能否直接运行任意AI模型?”残酷现实是:不能——因为我们缺乏CUDA这样的行业标准。
GPU与下一代计算如何前行?
由于我的行业背景与产业内广泛联系,我经常被问及计算的未来。当前,无数团队正在硬件领域创新(部分动力来自NVIDIA市值飙升),而许多软件团队正采用MLIR来支持新架构。与此同时,资深行业领袖们却在质疑:尽管投入了巨额资金,AI软件问题为何仍未解决?问题并非缺乏动力或资源,那么行业为何显得裹足不前?
我坚信我们并未停滞。但我们确实面临着艰难的基础性挑战。
要向前发展,我们需要更深刻地理解行业底层动态。计算是一个高度技术化的领域,发展迅速且充斥着行业术语、代号和精心设计的新闻稿——每项新技术都被包装成革命性突破。许多人试图穿透噪音看清全局,但要真正把握方向,必须审视根源:那些维持整个计算体系运转的基础构件。
AI计算的“哥德尔之困”
当前行业面临的根本矛盾:
要破局,需直面核心问题:
本系列文章将以工程师视角解剖:
🛠 CUDA的精密控制:内核驱动+工具链+生态绑定策略
📉 OpenCL的“二等公民”困局:标准碎片化与厂商博弈
🔮 MLIR的野心:能否实现跨硬件中间表示(IR)的统一?
🌌 Mojo语言的可能性:Python原生GPU编程的终极形态
DeepSeek的案例印证了“算法-硬件-软件”三角关系的永恒定律:任何单点突破都可能引发10倍级效应。AI革命远未触顶,但需要更底层的架构思维——这不仅是技术进步,更是开放精神的回归。
Chris Lattner