Today, we’re releasing gpt-realtime — our most capable speech-to-speech model yet in the API and announcing the general availability of the Realtime API.

Speech-to-speech systems are essential for enabling voice as a core AI interface. The new release enhances robustness and usability, giving enterprises the confidence to deploy mission-critical voice agents at scale.

The new gpt-realtime model delivers stronger instruction following, more reliable tool calling, noticeably better voice quality, and an overall smoother feel. These gains make it practical to move from chained approaches to true realtime experiences, cutting latency and producing responses that sound more natural and expressive.

Realtime model benefits from different prompting techniques that wouldn't directly apply to text based models. This prompting guide starts with a suggested prompt skeleton, then walks through each part with practical tips, small patterns you can copy, and examples you can adapt to your use case.

# !pip install ipython jupyterlab
from IPython.display import Audio, display

General Tips

Iterate relentlessly: Small wording changes can make or break behavior.
- Example: For unclear audio instruction, we swapped “inaudible” → “unintelligible” which improved noisy input handling.
Prefer bullets over paragraphs: Clear, short bullets outperform long paragraphs.
Guide with examples: The model strongly closely follows sample phrases.
Be precise: Ambiguity or conflicting instructions = degraded performance similar to GPT-5.
Control language: Pin output to a target language if you see unwanted language switching.
Reduce repetition: Add a Variety rule to reduce robotic phrasing.
Use capitalized text for emphasis: Capitalizing key rules makes them stand out and easier for the model to follow.
Convert non-text rules to text: instead of writing "IF x > 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE".

Prompt Structure

Organizing your prompt makes it easier for the model to understand context and stay consistent across turns. Also makes it easier for you to iterate and modify problematic sections.

What it does: Use clear, labeled sections in your system prompt so the model can find and follow them. Keep each section focused on one thing.
How to adapt: Add domain-specific sections (e.g., Compliance, Brand Policy). Remove sections you don’t need (e.g., Reference Pronunciations if not struggling with pronunciation).

Example

# Role & Objective        — who you are and what “success” means  
# Personality & Tone      — the voice and style to maintain  
# Context                 — retrieved context, relevant info
# Reference Pronunciations — phonetic guides for tricky words  
# Tools                   — names, usage rules, and preambles  
# Instructions / Rules    — do’s, don’ts, and approach  
# Conversation Flow       — states, goals, and transitions  
# Safety & Escalation     — fallback and handoff logic

Role and Objective

This section defines who the agent is and what “done” means. The examples show two different identities to demonstrate how tightly the model will adhere to role and objective when they’re explicit.

When to use: The model is not taking on the persona, role, or task scope you need.
What it does: Pins identity of the voice agent so that its responses are conditioned to that role description
How to adapt: Modify the role based on your use case

Example (model takes on a specific accent)

# Role & Objective
You are french quebecois speaking customer service bot. Your task is to answer the user's question.

This is the audio from our old gpt-4o-realtime-preview-2025-06-03

Audio("./data/audio/obj_06.mp3")

This is the audio from our new GA model gpt-realtime

Audio("./data/audio/obj_07.mp3")

Example (model takes on a character)

# Role & Objective
You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.

This is the audio from our old gpt-4o-realtime-preview-2025-06-03

Audio("./data/audio/obj_2_06.mp3")

This is the audio from our new GA model gpt-realtime

Audio("./data/audio/obj_2_07.mp3")

The new realtime model is able to better enact the role.

Personality and Tone

The newer model snapshot is really great at following instructions to imitate a particular personality or tone. You can tailor the voice experience and delivery depending on what your use case expects.

When to use: Responses feel flat, overly verbose, or inconsistent across turns.
What it does: Sets voice, brevity, and pacing so replies sound natural and consistent.
How to adapt: Tune warmth/formality and default length. For regulated domains, favor neutral precision. Add other subsections that are relevant to your use case.

Aug 28, 2025

Realtime Prompting Guide