Technology & InnovationNeutral
30

CAISI Claims DeepSeek 8 Months Behind, Experts Push Back

A US government institute, CAISI, evaluated DeepSeek V4 Pro, claiming it lags 8 months behind US frontier models like GPT-5.5. The assessment relied on private benchmarks and filtered cost comparisons, but public benchmarks show a much smaller gap, fueling debate over AI competition measurement.

DecryptJose Antonio Lanz

Quick Take

1

CAISI claims DeepSeek V4 Pro lags US frontier by 8 months based on private benchmarks.

2

Experts note public benchmark gap is only 2.7% and cost comparison was selectively filtered.

3

DeepSeek scored 90% on GPQA-Diamond, close to Opus 4.6, raising methodology questions.

4

The evaluation sparked debate on whether the AI race gap is real or an artifact.

Market Impact Analysis

Neutral

The article discusses AI model evaluations and does not pertain to crypto markets; therefore, no market impact.

Timeframeshort

Speculation Analysis

Factuality80/100
RumorsVerified
Speculation Trigger0/100
MinimalExtreme FOMO

Key Takeaways

  • CAISI claims DeepSeek V4 Pro trails the U.S. AI frontier by eight months, using private benchmarks that can't be independently verified.
  • Public benchmarks show a narrow 2.7% gap between U.S. and Chinese AI, contradicting CAISI's dire assessment.
  • DeepSeek scored 90% on GPQA-Diamond, nearly matching Anthropic's Opus 4.6 at 91%, raising doubts about the methodology.
  • Cost comparisons were filtered to exclude expensive or weak U.S. models, leaving only GPT-5.4 mini against DeepSeek.
DeepSeek Elo Score ~800 vs. GPT-5.5 at 1,260
Public Gap 2.7% Stanford 2026 AI Index
GPQA-Diamond 90% PhD-level reasoning
Cost Filter Excluded Most Only GPT-5.4 mini qualified

What Happened

On May 1, the U.S. government’s Center for AI Standards and Innovation (CAISI), a NIST unit, dropped an evaluation of China’s DeepSeek V4 Pro. The verdict: the open-weight model lags the U.S. frontier by roughly eight months. CAISI used Item Response Theory (IRT) to estimate latent capability, placing DeepSeek’s Elo score around 800—far below GPT-5.5’s 1,260 and Opus 4.6’s 999. Almost immediately, the methodology sparked backlash. Two of the nine benchmark datasets are private, making the results impossible to replicate. Critics argue the gap is an artifact of cherry-picked tests rather than a real capability deficit.

The Numbers

On public benchmarks, DeepSeek paints a very different picture. It hit 90% on GPQA-Diamond (PhD-level science reasoning), one point shy of Opus 4.6’s 91%. On math olympiad tests, it scored 96-97%. SWE-Bench Verified showed a 74% bug-fix resolution rate versus GPT-5.5’s 81%. Meanwhile, Stanford’s 2026 AI Index pegs the U.S.-China public leaderboard gap at just 2.7%. CAISI’s cost comparison filtered out all U.S. models except GPT-5.4 mini, against which DeepSeek was cheaper on five of seven benchmarks—a narrow metric of questionable relevance.

Why It Happened

CAISI’s evaluation was designed to measure capability through IRT, which weights problem difficulty—think SAT scoring for AI. The method amplifies gaps when models fail harder questions, and the private benchmarks likely contain challenges that favor U.S. models. By excluding costlier competitors, the cost analysis sidestepped more powerful but expensive U.S. offerings. Public leaderboards have been closing for years, so a claim of an eight-month lag flies in the face of converging open-source and private performance. The selective methodology suggests institutional caution in acknowledging parity.

Broader Impact

The dispute puts AI evaluation standards under a microscope. If government assessments rely on secret tests, it undermines trust in U.S.-China tech comparisons. Model progression metrics could shift toward public, reproducible benchmarks, forcing policymakers to reconcile divergent data. For the open-source community, DeepSeek’s strong public scores validate the global diffusion of AI capabilities, complicating containment narratives.

What to Watch Next

  • Third-party auditors may attempt to replicate CAISI’s private benchmarks—success will validate or debunk the gap.
  • DeepSeek’s next model iteration could close the public benchmark gap entirely, testing the narrative of a permanent U.S. lead.
  • Regulatory moves by the U.S. government to restrict open-weight releases may gain momentum if the security case weakens.
Source: Decrypt

This article is for informational purposes only and does not constitute financial advice.

SourceRead the full article on Decrypt
Read full article

Always late to trends?

Join for the latest news, insights & more.

Disclaimer: Bytewit is an independent media outlet that delivers news, research, and data.

© 2026 Bytewit. All Rights Reserved. This article is for informational purposes only.

Read Next

Most Read

🏛️
Institutional & Investment NewsNeutral
48

K-Pop Firm Dumps Bitcoin Treasury for AI, Stock Plunges 25%

K Wave Media shifted from a Bitcoin treasury strategy to AI infrastructure, accessing $485M in redirected funding. The pivot caused a 25% stock drop to $0.307. Shareholders will vote on a rebrand to Talivar Technologies in July, while the company sold its K-pop subsidiary to clear $48M in debt.

BTC
90% confidence
May 4, 2026, 9:15 PM UTC · Decrypt
DeepSeek Lag Claims Under Fire as Experts Question CAISI | Bytewit