Multimodal

Multimodal

Multimodality refers to a model's ability to understand and generate content using various input types—such as text, images, audio, and video. Multimodal models combine diverse data sources to interpret complex contexts, enabling more comprehensive and nuanced responses.

VisionImagesspeech
Jan 25, 2026
AudioEvalsResponsesspeech
Realtime Eval Guide
Dec 16, 2025
ImagesVision
Gpt-image-1.5 Prompting Guide
Nov 20, 2025
Audiospeech
Transcribing User Audio with a Separate Realtime Request
Aug 28, 2025
AudioResponsesspeech
Realtime Prompting Guide
Jul 17, 2025
Images
Generate images with high input fidelity
Jul 15, 2025
EvalsImages
Using Evals API on Image Inputs
May 29, 2025
Audiospeech
Practical guide to data-intensive apps with the Realtime API
May 16, 2025
ImagesResponsesVision
Image Understanding with RAG
May 10, 2025
AudiospeechTiktoken
Context Summarization with Realtime API

Other15