A benchmark is a fixed set of tasks used to score an AI model so it can be compared to others — coding challenges, exam questions, reasoning puzzles, agentic tasks. Every model launch leads with them: "88% on this, state-of-the-art on that." They're the closest thing the field has to a common yardstick.

They're also easy to over-read. A benchmark measures performance on its tasks, which may look nothing like your work. Models can be tuned to score well on famous tests ("teaching to the test"), and a single percentage hides where a model is strong or brittle.

Why it matters at your desk. Benchmarks are how you make sense of the firehose of model news — the tables in launches like Opus 4.8, GPT-5.5, and Grok 4.3 are benchmark results. The literacy move is to ask which benchmark: a jump on a coding benchmark means little to a lawyer, while an "all-pass" legal eval (every clause correct, not just most) maps far better to real document work. For research claims, tools like Consensus help you check whether a result actually replicates.

What to watch for: treat benchmarks as directional, not definitive. The only benchmark that fully counts is your own task — pilot a model on your real work before trusting the leaderboard.