Multimodal

Multimodal

Multimodality refers to a model's ability to understand and generate content using various input types—such as text, images, audio, and video. Multimodal models combine diverse data sources to interpret complex contexts, enabling more comprehensive and nuanced responses.

VisionImagesspeech
May 29, 2025
Audiospeech
Practical guide to data-intensive apps with the Realtime API
May 16, 2025
ImagesResponsesVision
Image Understanding with RAG
May 10, 2025
AudiospeechTiktoken
Context Summarization with Realtime API
May 1, 2025
Audiospeech
ElatoAI - Realtime Speech AI Agents for ESP32 on Arduino
Apr 29, 2025
Agents SDKAudiospeech
Comparing Speech-to-Text Methods with the OpenAI API
Apr 23, 2025
Images
Generate images with GPT Image
Apr 22, 2025
ResponsesspeechVision
Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API
Mar 27, 2025
AudioResponsesspeech
Building a Voice Assistant with the Agents SDK
Mar 24, 2025
Audiospeech
Multi-Language One-Way Translation with the Realtime API

Other9