How to Run DiffusionGemma Locally
DiffusionGemma ships as open weights under Apache 2.0, and because its Mixture-of-Experts design activates only ~3.8B of its 25.2B parameters per step, it is genuinely runnable outside a datacenter — quantized builds fit in about 18 GB of VRAM. This guide walks through every officially supported path: Hugging Face transformers, vLLM serving, and community quantizations for llama.cpp, Ollama, and LM Studio. Commands come from the official model card and the vLLM announcement; where the ecosystem is still settling we say so instead of guessing.
1. Hardware Requirements
Sizing in one paragraph: the sparse MoE keeps the active compute small, but the full 25.2B-parameter weights still need to sit in memory. Quantized community builds run in roughly 18 GB of VRAM, which puts 24 GB consumer cards (RTX 3090/4090 class) comfortably in range. For full-speed inference, the published benchmarks used H100 and H200 GPUs — the FP8 checkpoint reaches 1,008 tokens/second on H100 and 1,288 tokens/second on H200. CPU-only inference works through llama.cpp GGUF builds but sacrifices the speed that is the model's whole point.
2. Hugging Face Transformers
The reference path. Install current packages, then load the model with the dedicated block-diffusion class — note it is not the usual CausalLM class:
pip install -U transformers torch acceleratefrom transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor
MODEL_ID = "google/diffusiongemma-26B-A4B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto",
)device_map="auto" spreads the experts across available GPUs (or offloads to CPU when VRAM runs short). The processor handles the multimodal input side — text, image, and video in; text out.
3. vLLM (Recommended for Serving)
DiffusionGemma is the first diffusion LLM natively supported in vLLM, and vLLM is where the headline throughput numbers were measured. The team maintains an official deployment recipe at recipes.vllm.ai — follow it for the current serve flags rather than copying possibly-stale commands. Two ready-made quantized checkpoints are published alongside:
RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic— the FP8 build behind the 1,288 tok/s H200 benchmark.RedHatAI/diffusiongemma-26B-A4B-it-NVFP4— NVFP4 for newer NVIDIA accelerators.
4. llama.cpp, Ollama, and LM Studio (GGUF)
If you want the model on a laptop or a single consumer GPU, quantized community builds are the way in. At launch the Hugging Face model page already listed 18 quantized variants targeting llama.cpp, Ollama, LM Studio, and Jan. The ecosystem is moving fast (the model is days old), so rather than hardcoding a quant name here: open the model page, click the Quantizations panel, and pick the newest build for your runtime and VRAM budget. Community Unsloth-based fine-tuning workflows are appearing through the same channel — search the suggest-friendly terms "diffusion gemma unsloth" or "diffusion gemma gguf" on Hugging Face to find them.
5. Vertex AI (Managed Google Cloud)
For teams that want managed infrastructure without owning GPUs, the model is available through Vertex AI. You deploy it to your own Google Cloud project endpoints — distinct from a public pay-per-token API, which does not exist for DiffusionGemma yet. Google also publishes a Kubernetes deployment tutorial and the hackable_diffusion repository for fine-tuning experiments.
6. Troubleshooting and Staying Current
DiffusionGemma is days old, and that shapes how you should debug it. If DiffusionGemmaForBlockDiffusion import fails, your transformers build predates the model — upgrade with pip install -U transformers before anything else, since the class only exists in releases that shipped alongside the model. If loading succeeds but you run out of memory, drop to a quantized checkpoint rather than fighting offload settings: the FP8 and NVFP4 builds exist precisely because the full-precision weights are the heavy path. And if generation quality looks rough, that is currently expected — early community testing places DiffusionGemma's output below standard Gemma 4, so judge it on speed, infilling, and editing rather than prose polish.
Three places worth watching as the ecosystem settles: the Hugging Face model page's discussion tab (where quantization releases and loading fixes land first), the vLLM recipe page (serve flags are still being tuned), and Google's own DiffusionGemma docs for inference guides and the fine-tuning repository. Whatever you build this week, re-check those sources before promoting it — flags and best practices for a model this new can change between minor releases.
7. Which Path Should You Pick?
A quick decision table, matching each common situation to the route that wastes the least of your time:
- Prototyping prompts: the free playground on our homepage (runs Gemini 2.5 Flash, labeled as such) — zero setup.
- Single consumer GPU / laptop: GGUF quantization via llama.cpp, Ollama, or LM Studio.
- Production-style serving / benchmarks: vLLM with the FP8 checkpoint on H100-class hardware.
- Research and fine-tuning: transformers + hackable_diffusion.
- Managed cloud: Vertex AI deployment.
FAQ
Can I run DiffusionGemma on a consumer GPU?
Yes — quantized builds fit in roughly 18 GB of VRAM, so 24 GB consumer cards (RTX 3090/4090 class) can run it. Full-precision serving is happiest on an H100-class accelerator.
Is there a DiffusionGemma GGUF for llama.cpp or Ollama?
Yes. The Hugging Face model page lists community quantizations (18 variants at launch) covering llama.cpp, Ollama, LM Studio, and Jan. Open the model page and use the “Quantizations” panel to pick a build for your runtime.
Which vLLM version supports DiffusionGemma?
DiffusionGemma is the first diffusion LLM natively supported in vLLM, announced June 10, 2026. Use the latest vLLM release and follow the official recipe at recipes.vllm.ai for current flags; the team also published FP8 and NVFP4 quantized checkpoints under the RedHatAI org.
Is there a hosted DiffusionGemma API I can call instead?
Not yet. As of mid-June 2026 no inference provider serves native DiffusionGemma. Your options are self-hosting (this guide) or Vertex AI model deployment on Google Cloud.