Multimodal describes an AI that handles more than one kind of input or output — text and images, audio, or video. An earlier generation of models only read and wrote words. A multimodal model can look at a screenshot, listen to a recording, watch a clip, and respond in kind, treating a picture or a sound as just another thing it can reason about.

This is less a feature than a change in what counts as a question. You no longer have to describe the chart, the X-ray, or the mockup in words — you show it.

Why it matters at your desk. For a designer, multimodal is what lets Figma Make turn a visual idea into a working layout, and for anyone working in audio and video, Descript treats a recording as editable material rather than an opaque file. The frontier is moving fast toward real-time: Gemini's live audio preview points at assistants you can simply talk to and show things to, in the moment.

What to watch for: a model reading an image is not the same as a model understanding it correctly — multimodal output hallucinates too, and a confident misread of a medical scan or a contract screenshot carries the same risk as a confident wrong sentence. Treat what it sees with the same verification you give what it says.