OpenAI releases GPT-5.5 and GPT-5.5 Pro, weeks after 5.4

OpenAI announced GPT-5.5 and GPT-5.5 Pro on April 23, 2026, with API availability the following day. CEO branding aside ("smartest and most intuitive to use model" yet), the release is most notable for the cadence — it lands only weeks after GPT-5.4 — and for the gap between the standard and Pro tiers.

What it does

OpenAI is positioning GPT-5.5 as a step toward agentic computer use: writing and debugging code, online research, data analysis, document and spreadsheet creation, and operating across multiple tools to finish a task end-to-end.

Context window and pricing

Context window: 1M tokens
GPT-5.5: $5 / 1M input tokens, $30 / 1M output tokens
GPT-5.5 Pro: $30 / 1M input tokens, $180 / 1M output tokens

The Pro tier is 6× the input price and 6× the output price of standard 5.5 — a wider gap than past Pro/standard splits.

Availability

ChatGPT: Plus, Pro, Business, and Enterprise tiers, plus Codex
GPT-5.5 Pro: Pro, Business, and Enterprise only

The benchmark numbers

OpenAI's headline number is Terminal-Bench 2.0 — a test of multi-step command-line workflows with planning, iteration, and tool coordination — where GPT-5.5 hits a state-of-the-art 82.7%. The full head-to-head against Anthropic's week-earlier Opus 4.7 release is more interesting than the headline.

Coding & terminal

Benchmark	GPT-5.5	Opus 4.7
Terminal-Bench 2.0	82.7%	69.4%
Expert-SWE (OpenAI internal)	73.1%	—
SWE-bench Pro	58.6%	64.3%¹
SWE-bench Verified	—	87.6%

Agents & tool use

Benchmark	GPT-5.5	Opus 4.7
MCP-Atlas (tool orchestration)	75.3%	79.1%
OSWorld-Verified (computer use)	78.7%	78.0%
BrowseComp (agentic search)	84.4%	79.3%
GDPval (knowledge work)	84.9%	80.3%
CyberGym (security)	81.8%	73.8%

Reasoning & math

Benchmark	GPT-5.5	Opus 4.7
GPQA Diamond	93.6%	94.2%
ARC-AGI-2	85.0%	75.8%
FrontierMath Tier 4	35.4%	22.9%
Humanity's Last Exam (no tools)	41.4%	46.9%
Humanity's Last Exam (with tools)	52.2%	54.7%

Long context

Benchmark	GPT-5.5	Opus 4.7
MRCR v2 (128K–256K)	87.5%	59.2%
MRCR v2 (512K–1M)	74.0%	32.2%

¹ Anthropic's reported figure; OpenAI's announcement notes contamination concerns on this benchmark.

The pattern is clear once you stop reading row-by-row: GPT-5.5 wins on terminal/agentic command-line work, hardest math, and long-context retrieval. Opus 4.7 wins on real-world software engineering benchmarks (SWE-bench), graduate science reasoning, and knowledge-work evals on harder questions. Neither model dominates — they're trading wins by category.

How it scores by category

The third-party benchmark aggregator BenchLM places GPT-5.5 at #5 of 112 tracked models, with these category averages out of 100:

Reasoning — 100.0 (MuSR, LongBench v2, MRCR v2, ARC-AGI-2)
Agentic — 99.5 (Terminal-Bench 2.0, GAIA, TAU-bench, WebArena)
Knowledge — 98.6 (GPQA, SuperGPQA, MMLU-Pro, HLE, FrontierScience, SimpleQA)
Math — 97.7 (AIME 2025, MATH-500, FrontierMath, BRUNO 2025)
Coding — 85.6 (SWE-bench Verified, LiveCodeBench, SWE-bench Pro, SciCode)
Multimodal — 57.2 (MMMU-Pro, OfficeQA Pro, CharXiv)

The multimodal score is the surprise — much weaker than the rest of the model's profile, and the spot where Opus 4.7's vision improvements could open a real gap.

Safety

OpenAI describes the launch as carrying "its strongest set of safeguards to date," with red-teaming, targeted cybersecurity and biology testing, and feedback from roughly 200 early-access partners.

What we'd watch

For working professionals — particularly engineers, analysts, and writers — the question isn't whether 5.5 is better than 5.4 (it should be) but whether the Pro tier's price premium pays off for any workflow short of pure research. We'll have notes in the next Friday Dispatch.

Sources: openai.com/index/introducing-gpt-5-5 · GPT-5.5 system card · TechCrunch · benchmark aggregation via BenchLM · head-to-head via DigitalApplied