⚡

Technology & InnovationNeutral

95% confidence

Google's DiffusionGemma Open Model Generates 1000 Tokens/Second for Free

Google released DiffusionGemma, an open-weight AI model using text diffusion to generate 256-token blocks at 1000 tokens/second on H100 GPUs. It requires a custom drafter for local inference, limiting consumer use, and prioritizes speed over quality. It's the first major open text diffusion model.

Jun 10, 2026, 10:01 PM UTCDecryptJose Antonio Lanz

Quick Take

Google's DiffusionGemma hits 1000 tokens/sec via parallel text diffusion.

Open-weight, Apache 2.0 licensed, available on Hugging Face.

Requires custom drafter not yet in public runtimes, limiting use.

Trails Gemma 4 in quality; emphasizes speed for structured tasks.

Market Impact Analysis

Neutral

The article focuses on an AI model release with no direct crypto market implications.

Timeframeshort

Speculation Analysis

Factuality90/100

RumorsVerified

Speculation Trigger10/100

MinimalExtreme FOMO

Key Takeaways

DiffusionGemma outputs 256-token blocks at 1,000 tokens/second—4x faster than autoregressive models.
Open-weight under Apache 2.0, the model is live on Hugging Face but lacks local runtime support.
A custom drafter module required for inference hasn't been integrated into any public framework yet.
Speed comes at a quality cost: it trails Gemma 4 but excels at structured tasks like code infilling.

Speed 1,000 tok/s on NVIDIA H100

Block Size 256 tokens per forward pass

License Apache 2.0 open-weight

Sudoku Accuracy 80% after fine-tuning

What Happened

Google unleashed DiffusionGemma, an open-weight language model that breaks the autoregressive mold. Instead of generating one token at a time, it starts with noise and refines entire 256-token blocks in parallel. On an NVIDIA H100, it pumps out over 1,000 tokens per second — quadruple the speed of standard models. The weights are free under Apache 2.0 and available on Hugging Face. But there's a hitch: the custom drafter module it needs for inference isn't bundled into any local runtime. That makes it a showpiece for cloud setups, not your desktop just yet.

The Numbers

DiffusionGemma's headline figure is 1,000+ tokens per second on an H100, with 700+ tokens per second on a consumer RTX 5090. Each forward pass handles a chunk of 256 tokens, leveraging bidirectional attention that's impossible in sequential models. On NVIDIA NIM, context length is capped at 8,192 tokens — well below the 64,000 needed for many agentic workflows. Quality benchmarks confirm it lags behind Gemma 4, but a fine-tuned version hit 80% accuracy on Sudoku puzzles, showcasing its knack for structured output where the end shapes the beginning.

Why It Happened

Text diffusion has lingered in academic labs for years. Google's move is an attempt to push it into the mainstream. Parallel token generation sidesteps the sequential bottleneck of autoregressive models, slashing latency for bulk text generation. Bidirectional attention also makes it a natural fit for tasks like code completion or constraint-heavy generation. By open-sourcing the weights, Google is betting that community tooling will eventually catch up, mirroring the path image diffusion models took from research curiosities to industry workhorses.

Broader Impact

This could be a tipping point for diffusion-based language models. With an Apache 2.0 license and Hugging Face integration, developers can already tinker. As local runtimes add drafter support, the speed-optimized model may find a niche in real-time applications and edge computing. It also raises the stakes for other labs: Inception's Mercury 2 proved commercial viability, but Google's open release could spawn a wave of fine-tuned diffusion LLMs tailored for structured tasks.

What to Watch Next

Community drafter support: Keep an eye on mlx-lm, LM Studio, and vLLM for integration that would unlock local inference.
Quality benchmarks: Upcoming comparisons will show if diffusion models can narrow the quality gap while retaining speed.
Structured task fine-tuning: Expect specialized variants for code infilling, data formatting, and puzzle-solving to hit Hugging Face.

This article is for informational purposes only and does not constitute financial advice.

SourceRead the full article on Decrypt

Read full article

Always late to trends?

Join for the latest news, insights & more.

Disclaimer: Bytewit is an independent media outlet that delivers news, research, and data.

Google's DiffusionGemma Open Model Generates 1000 Tokens/Second for Free

Quick Take

Market Impact Analysis

Speculation Analysis

Key Takeaways

What Happened

The Numbers

Why It Happened

Broader Impact

What to Watch Next

Always late to trends?

TAGS

Read Next

KelpDAO $292M Exploit Triggers Aave Bank Run, DeFi in Crisis

Ethereum Risks $1.5K Drop from Vitalik's ETH Sales

Most Read

CFTC Proposes Rules Favoring Sports Prediction Markets Over Gambling

Google's DiffusionGemma Open Model Generates 1000 Tokens/Second for Free

Anthropic CEO Urges Binding AI Safety Regulation

BTC Miner Margins Crash to Record Low, $60K Support Under Threat

Raydium Exploited for $1.34M via Deprecated Pools

Tether Leads $1.4B NEURA Robotics Funding with Nvidia, Amazon

Claude Fable 5 Sparks Furious Backlash Over Token Drain and Data Retention

Platform

Company

Legal