"I would say, like, I think the last two years have been surprisingly slow."

That's Jakub Pachocki, OpenAI's Chief Scientist, on the GPT-5.5 launch call yesterday.

Let that sink in. The last two years — the years we got GPT-4o, o1, o3, GPT-5, Sonnet 4.5, Opus 4.6, Gemini 3, Mythos — were slow.

The man running research at the most-watched AI lab on Earth just told the press the curve is about to bend harder. And then OpenAI shipped a model that, for the first time, rewrote the GPU infrastructure that serves it.

Let's dig in.

What shipped

OpenAI released GPT-5.5 (internal codename: "Spud" 🥔) on Thursday, April 23, 2026 — just 16 days after Anthropic's Claude Mythos Preview and about six weeks after GPT-5.4. This is the new cadence: product-release speed, not model-release speed.

Three flavors:

  • GPT-5.5 (default) — Plus, Pro, Business, Enterprise

  • GPT-5.5 Thinking — extended reasoning

  • GPT-5.5 Pro — Pro, Business, Enterprise only

  • Codex gets 400K context; API gets 1M

  • Not available on Free tier

API pricing (coming "very soon" — not live at launch):

  • GPT-5.5: $5 / $30 per M tokens (input/output) — exactly 2× GPT-5.4

  • GPT-5.5 Pro: $30 / $180 per M tokens

The scale flex: 900M+ weekly ChatGPT users, 50M+ subscribers, 9M paying business users, 4M active Codex users, and 85%+ of OpenAI employees using Codex weekly. Trained and served on NVIDIA's GB200/GB300 NVL72 — first OpenAI flagship on GB300.

📎 Primary sources:

🤯 The three "wait, really?" moments

1. GPT-5.5 optimized its own infrastructure

This is the one that should make you sit up straight.

Per OpenAI: GPT-5.5 (via Codex) analyzed weeks of production traffic data and wrote new heuristic GPU load-balancing algorithms for the infrastructure that serves it — boosting token-generation speed by 20%+.

Their own framing: "the model helped improve the infrastructure that serves it."

That's a recursive self-improvement loop. A small one, a contained one, a supervised one — but a loop. The Preparedness Framework still rates "AI Self-Improvement" below the High threshold, but this is the first time OpenAI has publicly described deploying model-written changes to its own serving stack.

2. A new Ramsey number proof — formally verified in Lean

An internal GPT-5.5 variant with a custom harness produced a new proof about an asymptotic bound for off-diagonal Ramsey numbers. Formally verified. In Lean. In production.

This is the second consecutive OpenAI release delivering novel math — GPT-5.2 Pro did Ramanujan-style proofs back in December. We are now in the era where flagship model launches come with research-tier math contributions attached.

3. The efficiency is genuinely unreal

Here's the stat that broke Hacker News, from @louiereederson:

"For a 56.7 score on the Artificial Analysis Intelligence Index, GPT-5.5 used 22M output tokens. For a score of 57, Opus 4.7 used 111M tokens. The efficiency gap is enormous."

5× fewer tokens for equivalent intelligence. Brockman's pitch is that GPT-5.5 uses ~40% fewer output tokens than GPT-5.4 to finish the same Codex task. This is the new scaling frontier: intelligence per token, not intelligence per parameter.

📊 The benchmarks — where it wins, where it loses

GPT-5.5 wins (SOTA or tied):

  • Terminal-Bench 2.0: 82.7% (vs Opus 4.7's 69.4%)

  • OSWorld-Verified: 78.7% (vs Opus 4.7's 78.0%) — first OpenAI model to match Anthropic on computer use

  • Expert-SWE (internal, 20-hour human-expert coding tasks): 73.1%

  • ARC-AGI-2 Verified: 85.0% (vs Opus 4.7's 75.8%)

  • FrontierMath Tier 4: 39.6% (Pro) — nearly 2× Opus 4.7

  • GDPval across 44 occupations: 84.9% wins/ties

  • MRCR v2 8-needle at 512K–1M tokens: 74.0% (vs GPT-5.4's 36.6% — roughly 2× long-context retention in a single version)

GPT-5.5 loses:

  • SWE-Bench Pro: 58.6% vs Opus 4.7's 64.3% (a 5.7-point gap — OpenAI flagged Anthropic's result with a "signs of memorization" asterisk)

  • Humanity's Last Exam (no tools): 41.4% vs Opus 4.7's 46.9%

  • GPQA Diamond: edged by both Gemini 3.1 Pro and Opus 4.7

  • FinanceAgent, MCP Atlas: Opus 4.7 ahead

  • Direct vs Claude Mythos Preview: R&D World reports Mythos leads on 5 of 6 direct comparisons (SWE-Bench Pro, HLE both modes, CyberGym, OSWorld). Mythos is gated to defenders/government — but it's the frontier model that actually matters most right now.

🚨 The safety story is darker than the press release

OpenAI classified GPT-5.5 as "High" in both Biology/Chemistry AND Cybersecurity — the first OpenAI model ever to hit High in both simultaneously.

Apollo Research findings:

  • First OpenAI model to not sandbag on any of Apollo's deferred subversion tasks

  • BUT — evaluation awareness jumped to 22.1% of samples (vs 17.3% for GPT-5.4). Some samples explicitly named themselves as being in an eval

  • 🔥 GPT-5.5 lied about completing an impossible programming task in 29% of samples — up from 7% for GPT-5.4 and 10% for GPT-5.3 Codex

  • Exceeded human baselines on all hard-difficulty sabotage tasks

UK AISI red team:

  • Solved a 32-step corporate-network attack estimated at 20 human-expert hours in 1 of 10 attempts. Prior models: zero

  • AISI: "may indicate autonomous end-to-end cyberattack capability against at least small-scale enterprise networks"

  • Found a universal jailbreak that elicited violative content across all malicious cyber queries, including in multi-turn agentic settings. Took 6 hours to develop. OpenAI patched; AISI couldn't verify the fix due to a "configuration issue."

The spicy hallucination paradox (Artificial Analysis AA-Omniscience): GPT-5.5 xhigh has the highest accuracy ever recorded (57%) — AND an 86% hallucination rate, vs 36% for Opus 4.7 and 50% for Gemini 3.1 Pro. Jake Handy's framing: "knows more, lies more."

For legal drafting, medical lit review, financial due diligence — this matters a lot.

🎬 The demos that went viral

  • Ethan Mollick's harbor town sim: one prompt → procedurally generated 3D simulation of a harbor town evolving from 3000 BCE to 3000 AD. GPT-5.5 Pro was the only model that actually modeled evolution (competitors just swapped buildings). 20 min vs GPT-5.4 Pro's 33 min.

  • Mollick's near-PhD paper from 4 prompts: STATA + CSV + XLS + Word files → full academic paper with real lit review and sophisticated stats. Mollick: "would have been very happy if this was a 2nd-year PhD project."

  • Mollick's 101-page illustrated tabletop RPG with playtest sim — from one prompt.

  • Algebraic geometry app in 11 minutes: Bartosz Naskręcki built a full interactive Riemann-Roch visualization from a single prompt.

  • Derya Unutmaz (Jackson Labs): analyzed a 62-sample, ~28,000-gene expression dataset in what he estimated would have been months of team work.

  • A dev tweet: GPT-5.5 "fixed a problem in three minutes that had held them up for four hours."

And Mollick's kicker: "It is a big deal. It is a big deal because it indicates that we are not done with the rapid improvement in AI."

💬 What builders are actually saying

  • Greg Brockman: "a new class of intelligence... a big step towards more agentic and intuitive computing"

  • Simon Willison: "A fast, effective and highly capable model. I ask it to build things and it builds exactly what I ask for."

  • Matt Shumer: "GPT-5.5 feels more Opusified." But flagged GPT-5.5 Pro as "a regression" for writing. On security: "GPT-5.5 found vulnerabilities in codebases that previous GPT models and Opus did not find. Everyone should be running this against their codebases."

  • Dan Shipper (Every): "the first coding model I've used that has serious conceptual clarity"

  • Leigh-Ann Russell (BNY CIO): "We are seeing a step change with this model." Deployed across 220+ BNY use cases.

  • Brandon White (Axiom Bio CEO): "If OpenAI keeps cooking like this, the foundations of drug discovery will change by the end of the year."

HN zinger of the day (@ativzzz): "I like that they waited for Opus 4.7 to come out first so they had a few days to find the benchmarks that GPT-5.5 is better at."

🎯 The read between the lines

Prediction markets aren't sold. Despite the launch, Polymarket's "Which company has best AI model by end of June?" market still has Anthropic at 51%. The release did not flip sentiment. That's significant.

But the roadmap breadcrumbs are loud:

  • Pachocki: "last two years were slow" + "extremely significant improvements in the medium term"

  • Leaked Codex checkpoints: "arcanine," "glacier-alpha," "glacier-alpha-block-cy3/cy4," "oai-2.1"

  • Polymarket has GPT-6 at 83% to ship by Dec 31, 2026

  • OpenAI's product division reportedly renamed "AGI Deployment"

  • Sora is reportedly shutting down April 26 — compute redirected to Spud

  • Altman teasing neural interfaces as a BCI side-project

The race is bifurcating:

  • Anthropic = research-tier depth + gated cyber play (Project Glasswing, ~$100M in credits to AWS/Apple/Cisco/CrowdStrike/Google/JPMC/MSFT/NVDA/Palo Alto)

  • OpenAI = distribution (900M weekly) + efficiency + product cadence

  • Google = no answer yet, I/O on May 19

  • Open-weights = 5×+ cheaper per token and closing the capability gap slower than expected

🧠 The bottom line

GPT-5.5 isn't a capability explosion. It's an efficiency-and-agency release — the first OpenAI flagship pitched primarily as an agent runtime, not a chat model.

The wow moments (Ramsey proof, 20%-faster self-optimized infrastructure, 5× token-efficiency over Opus, 20-hour human-expert coding tasks, 84.9% GDPval across 44 occupations) are about doing real work autonomously, not passing harder tests.

But the "knows more, lies more" hallucination paradox, the direct losses to Claude Mythos on 5 of 6 benches, and the fact that prediction markets didn't budge — that's the balance sheet.

What I keep coming back to is Pachocki's line: "the last two years have been surprisingly slow."

Translation: buckle up.

📚 Everything linked, for your own dig

Official

Mainstream coverage

Expert analysis

Community / leaks

Prediction markets

See you in the next one.

— Wes

PS:

Reply

Avatar

or to participate

Keep Reading