Multimodal

Multimodal

Multimodality refers to a model's ability to understand and generate content using various input types—such as text, images, audio, and video. Multimodal models combine diverse data sources to interpret complex contexts, enabling more comprehensive and nuanced responses.

VisionImagesspeech
Dec 16, 2025
ImagesVision
Gpt-image-1.5 Prompting Guide
Nov 20, 2025
Audiospeech
Transcribing User Audio with a Separate Realtime Request
Aug 28, 2025
AudioResponsesspeech
Realtime Prompting Guide
Jul 17, 2025
Images
Generate images with high input fidelity
Jul 15, 2025
EvalsImages
Using Evals API on Image Inputs
May 29, 2025
Audiospeech
Practical guide to data-intensive apps with the Realtime API
May 16, 2025
ImagesResponsesVision
Image Understanding with RAG
May 10, 2025
AudiospeechTiktoken
Context Summarization with Realtime API
May 1, 2025
Audiospeech
ElatoAI - Realtime Speech AI Agents for ESP32 on Arduino

Other14