Dynamo 1.0: NVIDIA's Operating System for AI Factories Enters Production
Dynamo 1.0 is NVIDIA's inference optimization software that delivers up to 7x performance boost on Blackwell GPUs. It rearchitects inference by splitting work between GPUs (prefill/attention) and Groq LPUs (decode/generation).
Key Takeaways
- 01
Delivers up to 7x inference performance boost on Blackwell GPUs — same hardware, updated software stack.
- 02
Uses disaggregated inference: splits prefill/attention (memory-heavy) to GPUs and decode/generation (bandwidth-limited) to Groq LPUs.
- 03
Already adopted by AWS, Azure, Google Cloud, Oracle, plus Cursor, Perplexity, ByteDance, PayPal, and Pinterest.
- 04
Standalone modules available: KVBM (memory), NIXL (GPU-to-GPU data movement), Grove (scaling).
The Core Insight
Jensen explained that throughput and latency are "enemies of each other" in chip design. Dynamo solves this via disaggregated inference — splitting prefill and attention (memory-heavy) to Rubin GPUs, and decode/token generation (bandwidth-limited) to Groq LPUs.
Performance
- Up to 7x inference performance boost on Blackwell GPUs
- Token speeds jumping from 700 to nearly 5,000 per second — same hardware, updated software stack
- Token generation speed across a one-gigawatt factory: 2 million → 700 million (350x increase in two years)
Adoption
Already deployed across:
- Cloud providers: AWS, Microsoft Azure, Google Cloud, Oracle
- AI-native companies: Cursor, Perplexity, Baseten, Deep Infra, Fireworks
- Enterprise: ByteDance, Meituan, PayPal, Pinterest
Core modules available standalone: KVBM (memory management), NIXL (GPU-to-GPU data movement), Grove (scaling). NVIDIA also contributes TensorRT-LLM CUDA kernels to the FlashInfer project for integration with vLLM, SGLang, and LangChain.
