Purpose: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio out-of-band using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).
We call this out-of-band transcription using the realtime model. It refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.
It covers how to build a server-to-server client that:
- Streams microphone audio to an OpenAI Realtime voice agent.
- Plays back the agent's spoken replies.
- After each user turn, generates a high-quality text-only transcript using the same Realtime model.
This is achieved via a secondary response.create request:
{
"type": "response.create",
"response": {
"conversation": "none",
"output_modalities": ["text"],
"instructions": transcription_instructions
}
}This notebook demonstrates using the Realtime model itself for transcription:
- Context-aware transcription: Uses the full session context to improve transcript accuracy.
- Non-intrusive: Runs outside the live conversation, so the transcript is never added back to session state.
- Customizable instructions: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.
