CAISI Claims DeepSeek 8 Months Behind, Experts Push Back
A US government institute, CAISI, evaluated DeepSeek V4 Pro, claiming it lags 8 months behind US frontier models like GPT-5.5. The assessment relied on private benchmarks and filtered cost comparisons, but public benchmarks show a much smaller gap, fueling debate over AI competition measurement.
Quick Take
CAISI claims DeepSeek V4 Pro lags US frontier by 8 months based on private benchmarks.
Experts note public benchmark gap is only 2.7% and cost comparison was selectively filtered.
DeepSeek scored 90% on GPQA-Diamond, close to Opus 4.6, raising methodology questions.
The evaluation sparked debate on whether the AI race gap is real or an artifact.
Market Impact Analysis
NeutralThe article discusses AI model evaluations and does not pertain to crypto markets; therefore, no market impact.
Speculation Analysis
Key Takeaways
- CAISI claims DeepSeek V4 Pro trails the U.S. AI frontier by eight months, using private benchmarks that can't be independently verified.
- Public benchmarks show a narrow 2.7% gap between U.S. and Chinese AI, contradicting CAISI's dire assessment.
- DeepSeek scored 90% on GPQA-Diamond, nearly matching Anthropic's Opus 4.6 at 91%, raising doubts about the methodology.
- Cost comparisons were filtered to exclude expensive or weak U.S. models, leaving only GPT-5.4 mini against DeepSeek.
What Happened
On May 1, the U.S. government’s Center for AI Standards and Innovation (CAISI), a NIST unit, dropped an evaluation of China’s DeepSeek V4 Pro. The verdict: the open-weight model lags the U.S. frontier by roughly eight months. CAISI used Item Response Theory (IRT) to estimate latent capability, placing DeepSeek’s Elo score around 800—far below GPT-5.5’s 1,260 and Opus 4.6’s 999. Almost immediately, the methodology sparked backlash. Two of the nine benchmark datasets are private, making the results impossible to replicate. Critics argue the gap is an artifact of cherry-picked tests rather than a real capability deficit.
The Numbers
On public benchmarks, DeepSeek paints a very different picture. It hit 90% on GPQA-Diamond (PhD-level science reasoning), one point shy of Opus 4.6’s 91%. On math olympiad tests, it scored 96-97%. SWE-Bench Verified showed a 74% bug-fix resolution rate versus GPT-5.5’s 81%. Meanwhile, Stanford’s 2026 AI Index pegs the U.S.-China public leaderboard gap at just 2.7%. CAISI’s cost comparison filtered out all U.S. models except GPT-5.4 mini, against which DeepSeek was cheaper on five of seven benchmarks—a narrow metric of questionable relevance.
Why It Happened
CAISI’s evaluation was designed to measure capability through IRT, which weights problem difficulty—think SAT scoring for AI. The method amplifies gaps when models fail harder questions, and the private benchmarks likely contain challenges that favor U.S. models. By excluding costlier competitors, the cost analysis sidestepped more powerful but expensive U.S. offerings. Public leaderboards have been closing for years, so a claim of an eight-month lag flies in the face of converging open-source and private performance. The selective methodology suggests institutional caution in acknowledging parity.
Broader Impact
The dispute puts AI evaluation standards under a microscope. If government assessments rely on secret tests, it undermines trust in U.S.-China tech comparisons. Model progression metrics could shift toward public, reproducible benchmarks, forcing policymakers to reconcile divergent data. For the open-source community, DeepSeek’s strong public scores validate the global diffusion of AI capabilities, complicating containment narratives.
What to Watch Next
- Third-party auditors may attempt to replicate CAISI’s private benchmarks—success will validate or debunk the gap.
- DeepSeek’s next model iteration could close the public benchmark gap entirely, testing the narrative of a permanent U.S. lead.
- Regulatory moves by the U.S. government to restrict open-weight releases may gain momentum if the security case weakens.
This article is for informational purposes only and does not constitute financial advice.
Always late to trends?
Join for the latest news, insights & more.
Disclaimer: Bytewit is an independent media outlet that delivers news, research, and data.
© 2026 Bytewit. All Rights Reserved. This article is for informational purposes only.