⚡

Technology & InnovationNeutral

95% confidence

CAISI Claims DeepSeek 8 Months Behind, Experts Push Back

A US government institute, CAISI, evaluated DeepSeek V4 Pro, claiming it lags 8 months behind US frontier models like GPT-5.5. The assessment relied on private benchmarks and filtered cost comparisons, but public benchmarks show a much smaller gap, fueling debate over AI competition measurement.

May 4, 2026, 6:58 PM UTCDecryptJose Antonio Lanz

Quick Take

CAISI claims DeepSeek V4 Pro lags US frontier by 8 months based on private benchmarks.

Experts note public benchmark gap is only 2.7% and cost comparison was selectively filtered.

DeepSeek scored 90% on GPQA-Diamond, close to Opus 4.6, raising methodology questions.

The evaluation sparked debate on whether the AI race gap is real or an artifact.

Market Impact Analysis

Neutral

The article discusses AI model evaluations and does not pertain to crypto markets; therefore, no market impact.

Timeframeshort

Speculation Analysis

Factuality80/100

RumorsVerified

Speculation Trigger0/100

MinimalExtreme FOMO

Key Takeaways

CAISI claims DeepSeek V4 Pro trails the U.S. AI frontier by eight months, using private benchmarks that can't be independently verified.
Public benchmarks show a narrow 2.7% gap between U.S. and Chinese AI, contradicting CAISI's dire assessment.
DeepSeek scored 90% on GPQA-Diamond, nearly matching Anthropic's Opus 4.6 at 91%, raising doubts about the methodology.
Cost comparisons were filtered to exclude expensive or weak U.S. models, leaving only GPT-5.4 mini against DeepSeek.

DeepSeek Elo Score ~800 vs. GPT-5.5 at 1,260

Public Gap 2.7% Stanford 2026 AI Index

GPQA-Diamond 90% PhD-level reasoning

Cost Filter Excluded Most Only GPT-5.4 mini qualified

What Happened

On May 1, the U.S. government’s Center for AI Standards and Innovation (CAISI), a NIST unit, dropped an evaluation of China’s DeepSeek V4 Pro. The verdict: the open-weight model lags the U.S. frontier by roughly eight months. CAISI used Item Response Theory (IRT) to estimate latent capability, placing DeepSeek’s Elo score around 800—far below GPT-5.5’s 1,260 and Opus 4.6’s 999. Almost immediately, the methodology sparked backlash. Two of the nine benchmark datasets are private, making the results impossible to replicate. Critics argue the gap is an artifact of cherry-picked tests rather than a real capability deficit.

The Numbers

On public benchmarks, DeepSeek paints a very different picture. It hit 90% on GPQA-Diamond (PhD-level science reasoning), one point shy of Opus 4.6’s 91%. On math olympiad tests, it scored 96-97%. SWE-Bench Verified showed a 74% bug-fix resolution rate versus GPT-5.5’s 81%. Meanwhile, Stanford’s 2026 AI Index pegs the U.S.-China public leaderboard gap at just 2.7%. CAISI’s cost comparison filtered out all U.S. models except GPT-5.4 mini, against which DeepSeek was cheaper on five of seven benchmarks—a narrow metric of questionable relevance.

Why It Happened

CAISI’s evaluation was designed to measure capability through IRT, which weights problem difficulty—think SAT scoring for AI. The method amplifies gaps when models fail harder questions, and the private benchmarks likely contain challenges that favor U.S. models. By excluding costlier competitors, the cost analysis sidestepped more powerful but expensive U.S. offerings. Public leaderboards have been closing for years, so a claim of an eight-month lag flies in the face of converging open-source and private performance. The selective methodology suggests institutional caution in acknowledging parity.

Broader Impact

The dispute puts AI evaluation standards under a microscope. If government assessments rely on secret tests, it undermines trust in U.S.-China tech comparisons. Model progression metrics could shift toward public, reproducible benchmarks, forcing policymakers to reconcile divergent data. For the open-source community, DeepSeek’s strong public scores validate the global diffusion of AI capabilities, complicating containment narratives.

What to Watch Next

Third-party auditors may attempt to replicate CAISI’s private benchmarks—success will validate or debunk the gap.
DeepSeek’s next model iteration could close the public benchmark gap entirely, testing the narrative of a permanent U.S. lead.
Regulatory moves by the U.S. government to restrict open-weight releases may gain momentum if the security case weakens.

This article is for informational purposes only and does not constitute financial advice.

SourceRead the full article on Decrypt

Read full article

Always late to trends?

Join for the latest news, insights & more.

Disclaimer: Bytewit is an independent media outlet that delivers news, research, and data.

CAISI Claims DeepSeek 8 Months Behind, Experts Push Back

Quick Take

Market Impact Analysis

Speculation Analysis

Key Takeaways

What Happened

The Numbers

Why It Happened

Broader Impact

What to Watch Next

Always late to trends?

TAGS

Read Next

KelpDAO $292M Exploit Triggers Aave Bank Run, DeFi in Crisis

Ethereum Risks $1.5K Drop from Vitalik's ETH Sales

Most Read

K-Pop Firm Dumps Bitcoin Treasury for AI, Stock Plunges 25%

Bitcoin Nears Short-Term Cost Basis Breakout Above $80K

Haun Ventures Raises $1B to Fuel Crypto-AI Convergence

DeepClaude Swaps Claude Code's Backend for 17x Cheaper DeepSeek AI

GameStop's $55.5B eBay Bid Could Liquidate $368M Bitcoin Stash

Tether Gold Market Cap Passes $3.3B Amid Flight to Hard Assets

CAISI Claims DeepSeek 8 Months Behind, Experts Push Back

Platform

Company

Legal