OpenAI announced GPT-5.5 and GPT-5.5 Pro on April 23, 2026, with API availability the following day. CEO branding aside ("smartest and most intuitive to use model" yet), the release is most notable for the cadence — it lands only weeks after GPT-5.4 — and for the gap between the standard and Pro tiers.
What it does
OpenAI is positioning GPT-5.5 as a step toward agentic computer use: writing and debugging code, online research, data analysis, document and spreadsheet creation, and operating across multiple tools to finish a task end-to-end.
Context window and pricing
- Context window: 1M tokens
- GPT-5.5: $5 / 1M input tokens, $30 / 1M output tokens
- GPT-5.5 Pro: $30 / 1M input tokens, $180 / 1M output tokens
The Pro tier is 6× the input price and 6× the output price of standard 5.5 — a wider gap than past Pro/standard splits.
Availability
- ChatGPT: Plus, Pro, Business, and Enterprise tiers, plus Codex
- GPT-5.5 Pro: Pro, Business, and Enterprise only
The benchmark numbers
OpenAI's headline number is Terminal-Bench 2.0 — a test of multi-step command-line workflows with planning, iteration, and tool coordination — where GPT-5.5 hits a state-of-the-art 82.7%. The full head-to-head against Anthropic's week-earlier Opus 4.7 release is more interesting than the headline.
Coding & terminal
| Benchmark | GPT-5.5 | Opus 4.7 |
|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% |
| Expert-SWE (OpenAI internal) | 73.1% | — |
| SWE-bench Pro | 58.6% | **64.3%**¹ |
| SWE-bench Verified | — | 87.6% |
Agents & tool use
| Benchmark | GPT-5.5 | Opus 4.7 |
|---|---|---|
| MCP-Atlas (tool orchestration) | 75.3% | 79.1% |
| OSWorld-Verified (computer use) | 78.7% | 78.0% |
| BrowseComp (agentic search) | 84.4% | 79.3% |
| GDPval (knowledge work) | 84.9% | 80.3% |
| CyberGym (security) | 81.8% | 73.8% |
Reasoning & math
| Benchmark | GPT-5.5 | Opus 4.7 |
|---|---|---|
| GPQA Diamond | 93.6% | 94.2% |
| ARC-AGI-2 | 85.0% | 75.8% |
| FrontierMath Tier 4 | 35.4% | 22.9% |
| Humanity's Last Exam (no tools) | 41.4% | 46.9% |
| Humanity's Last Exam (with tools) | 52.2% | 54.7% |
Long context
| Benchmark | GPT-5.5 | Opus 4.7 |
|---|---|---|
| MRCR v2 (128K–256K) | 87.5% | 59.2% |
| MRCR v2 (512K–1M) | 74.0% | 32.2% |
¹ Anthropic's reported figure; OpenAI's announcement notes contamination concerns on this benchmark.
The pattern is clear once you stop reading row-by-row: GPT-5.5 wins on terminal/agentic command-line work, hardest math, and long-context retrieval. Opus 4.7 wins on real-world software engineering benchmarks (SWE-bench), graduate science reasoning, and knowledge-work evals on harder questions. Neither model dominates — they're trading wins by category.
How it scores by category
The third-party benchmark aggregator BenchLM places GPT-5.5 at #5 of 112 tracked models, with these category averages out of 100:
- Reasoning — 100.0 (MuSR, LongBench v2, MRCR v2, ARC-AGI-2)
- Agentic — 99.5 (Terminal-Bench 2.0, GAIA, TAU-bench, WebArena)
- Knowledge — 98.6 (GPQA, SuperGPQA, MMLU-Pro, HLE, FrontierScience, SimpleQA)
- Math — 97.7 (AIME 2025, MATH-500, FrontierMath, BRUNO 2025)
- Coding — 85.6 (SWE-bench Verified, LiveCodeBench, SWE-bench Pro, SciCode)
- Multimodal — 57.2 (MMMU-Pro, OfficeQA Pro, CharXiv)
The multimodal score is the surprise — much weaker than the rest of the model's profile, and the spot where Opus 4.7's vision improvements could open a real gap.
Safety
OpenAI describes the launch as carrying "its strongest set of safeguards to date," with red-teaming, targeted cybersecurity and biology testing, and feedback from roughly 200 early-access partners.
What we'd watch
For working professionals — particularly engineers, analysts, and writers — the question isn't whether 5.5 is better than 5.4 (it should be) but whether the Pro tier's price premium pays off for any workflow short of pure research. We'll have notes in the next Friday Dispatch.
Sources: openai.com/index/introducing-gpt-5-5 · GPT-5.5 system card · TechCrunch · benchmark aggregation via BenchLM · head-to-head via DigitalApplied