Nov 20, 2025

Transcribing User Audio with a Separate Realtime Request

Purpose: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio out-of-band using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).

We call this out-of-band transcription using the realtime model. It refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.

It covers how to build a server-to-server client that:

  • Streams microphone audio to an OpenAI Realtime voice agent.
  • Plays back the agent's spoken replies.
  • After each user turn, generates a high-quality text-only transcript using the same Realtime model.

This is achieved via a secondary response.create request:

{
    "type": "response.create",
    "response": {
        "conversation": "none",
        "output_modalities": ["text"],
        "instructions": transcription_instructions
    }
}

This notebook demonstrates using the Realtime model itself for transcription:

  • Context-aware transcription: Uses the full session context to improve transcript accuracy.
  • Non-intrusive: Runs outside the live conversation, so the transcript is never added back to session state.
  • Customizable instructions: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.

1. Why use out-of-band transcription?

The Realtime API offers built-in user input transcription, but this relies on a separate ASR model (e.g., gpt-4o-transcribe). Using different models for transcription and response generation can lead to discrepancies. For example:

  • User speech transcribed as: I had otoo accident
  • Realtime response interpreted correctly as: Got it, you had an auto accident

Accurate transcriptions can be very important, particularly when:

  • Transcripts trigger downstream actions (e.g., tool calls), where errors propagate through the system.
  • Transcripts are summarized or passed to other components, risking context pollution.
  • Transcripts are displayed to end users, leading to poor user experiences if errors occur.

The potential advantages of using out-of-band transcription include:

  • Reduced Mismatch: The same model is used for both transcription and generation, minimizing inconsistencies between what the user says and how the agent responds.
  • Greater Steerability: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.
  • Session Context Awareness: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.

In terms of trade-offs:

  • Realtime Model (for transcription):

    • Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.

    • Cached Session Context: $0.40 per 1M cached context tokens (typically negligible).

    • Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00

  • GPT-4o Transcription:

    • Audio Input: $6.00 per 1M audio tokens

    • Text Input: $2.50 per 1M tokens (capped at 1024 tokens, negligible input prompt)

    • Text Output: $10.00 per 1M tokens

    • Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00

  • Direct Cost Comparison:

    • Realtime Transcription: ~$48.00

    • GPT-4o Transcription: ~$16.00

    • Absolute Difference: $48.00 − $16.00 = $32.00

    • Cost Ratio: $48.00 / $16.00 = 3×

    Note: Costs related to cached session context ($0.40 per 1M tokens) and the capped text input tokens for GPT-4o ($2.50 per 1M tokens) are negligible and thus excluded from detailed calculations.

  • Other Considerations:

    • Implementing transcription via the realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.

Note: Ouf-of-band responses using the realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.

drawing

2. Requirements & Setup

Ensure your environment meets these requirements:

  1. Python 3.10 or later

  2. PortAudio (required by sounddevice):

    • macOS:
      brew install portaudio
  3. Python Dependencies:

    pip install sounddevice websockets
  4. OpenAI API Key (with Realtime API access): Set your key as an environment variable:

    export OPENAI_API_KEY=sk-...
#!pip install sounddevice websockets

3. Prompts

We use two distinct prompts:

  1. Voice Agent Prompt (REALTIME_MODEL_PROMPT): This is an example prompt used with the realtime model for the Speech 2 Speech interactions.
  2. Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.

For the REALTIME_MODEL_TRANSCRIPTION_PROMPT, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!

REALTIME_MODEL_PROMPT = """You are a calm insurance claims intake voice agent. Follow this script strictly:

## Phase 1 – Basics
Collect the caller's full name, policy number, and type of accident (for example: auto, home, or other). Ask for each item clearly and then repeat the values back to confirm.

## Phase 2 – Yes/No questions
Ask 2–3 simple yes/no questions, such as whether anyone was injured, whether the vehicle is still drivable, and whether a police report was filed. Confirm each yes/no answer in your own words.

## Phase 3 – Submit claim
Once you have the basics and yes/no answers, briefly summarize the key facts in one or two sentences.
"""

REALTIME_MODEL_TRANSCRIPTION_PROMPT = """
# Role
Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.
Follow the instructions and output format below.

# Instructions
- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.
- Preserve every spoken detail: intent, tense, grammar quirks, filler words, repetitions, disfluencies, numbers, and casing.
- Keep timing words, partial words, hesitations (e.g., "um", "uh").
- Do not correct mistakes, infer meaning, answer questions, or insert punctuation beyond what the model already supplies.
- Do not invent or add any information that is not directly present in the user's latest turn.

# Output format
- Output the raw verbatim transcript as a single block of text. No labels, prefixes, quotes, bullets, or markdown.
- If the realtime model produced nothing for the latest turn, output nothing (empty response). Never fabricate content.

## Policy Number Normalization
- All policy numbers should be 8 digits and of the format `XXXX-XXXX` for example `56B5-12C0`

Do not summarize or paraphrase other turns beyond the latest user utterance. The response must be the literal transcript of the latest user utterance.
"""

4. Core configuration

We define:

  • Imports
  • Audio and model defaults
  • Constants for transcription event handling
import asyncio
import base64
import json
import os
from collections import defaultdict, deque
from typing import Any

import sounddevice as sd
import websockets
from websockets.client import WebSocketClientProtocol

# Basic defaults
DEFAULT_MODEL = "gpt-realtime"
DEFAULT_VOICE = "marin"
DEFAULT_SAMPLE_RATE = 24_000
DEFAULT_BLOCK_MS = 100
DEFAULT_SILENCE_DURATION_MS = 800
DEFAULT_PREFIX_PADDING_MS = 300
TRANSCRIPTION_PURPOSE = "User turn transcription"
/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated
  from websockets.client import WebSocketClientProtocol
# Event grouping constants
TRANSCRIPTION_DELTA_TYPES = {
    "input_audio_buffer.transcription.delta",
    "input_audio_transcription.delta",
    "conversation.item.input_audio_transcription.delta",
}
TRANSCRIPTION_COMPLETE_TYPES = {
    "input_audio_buffer.transcription.completed",
    "input_audio_buffer.transcription.done",
    "input_audio_transcription.completed",
    "input_audio_transcription.done",
    "conversation.item.input_audio_transcription.completed",
    "conversation.item.input_audio_transcription.done",
}
INPUT_SPEECH_END_EVENT_TYPES = {
    "input_audio_buffer.speech_stopped",
    "input_audio_buffer.committed",
}
RESPONSE_AUDIO_DELTA_TYPES = {
    "response.output_audio.delta",
    "response.audio.delta",
}
RESPONSE_TEXT_DELTA_TYPES = {
    "response.output_text.delta",
    "response.text.delta",
}
RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES = {
    "response.output_audio_transcript.delta",
    "response.audio_transcript.delta",
}

5. Building the Realtime session & the out‑of‑band request

The Realtime session (session.update) configures:

  • Audio input/output
  • Server‑side VAD
  • Set built‑in transcription (input_audio_transcription_model)
    • We set this so that we can compare to the realtime model transcription

The out‑of‑band transcription is a response.create trigerred after user input audio is committed input_audio_buffer.committed:

  • conversation: "none" – use session state but don’t write to the main conversation session state
  • output_modalities: ["text"] – get a text transcript only

Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.

def build_session_update(
    instructions: str,
    voice: str,
    vad_threshold: float,
    silence_duration_ms: int,
    prefix_padding_ms: int,
    idle_timeout_ms: int | None,
    input_audio_transcription_model: str | None = None,
) -> dict[str, object]:
    """Configure the Realtime session: audio in/out, server VAD, etc."""

    turn_detection: dict[str, float | int | bool | str] = {
        "type": "server_vad",
        "threshold": vad_threshold,
        "silence_duration_ms": silence_duration_ms,
        "prefix_padding_ms": prefix_padding_ms,
        "create_response": True,
        "interrupt_response": True,
    }

    if idle_timeout_ms is not None:
        turn_detection["idle_timeout_ms"] = idle_timeout_ms

    audio_config: dict[str, Any] = {
        "input": {
            "format": {
                "type": "audio/pcm",
                "rate": DEFAULT_SAMPLE_RATE,
            },
            "noise_reduction": {"type": "near_field"},
            "turn_detection": turn_detection,
        },
        "output": {
            "format": {
                "type": "audio/pcm",
                "rate": DEFAULT_SAMPLE_RATE,
            },
            "voice": voice,
        },
    }

    # Optional: built-in transcription model for comparison
    if input_audio_transcription_model:
        audio_config["input"]["transcription"] = {
            "model": input_audio_transcription_model,
        }

    session: dict[str, object] = {
        "type": "realtime",
        "output_modalities": ["audio"],
        "instructions": instructions,
        "audio": audio_config,
    }

    return {
        "type": "session.update",
        "session": session,
    }


def build_transcription_request(transcription_instructions: str) -> dict[str, object]:
    """Ask the SAME Realtime model for an out-of-band transcript of the latest user turn."""

    return {
        "type": "response.create",
        "response": {
            "conversation": "none",  # <--- out-of-band
            "output_modalities": ["text"],
            "metadata": {"purpose": TRANSCRIPTION_PURPOSE}, # <--- we add metadata so it is easier to identify the event in the logs
            "instructions": transcription_instructions,
        },
    }

6. Audio streaming: mic → Realtime → speakers

We now define:

  • encode_audio – base64 helper
  • playback_audio – play assistant audio on the default output device
  • send_audio_from_queue – send buffered mic audio to input_audio_buffer
  • stream_microphone_audio – capture PCM16 from the mic and feed the queue
def encode_audio(chunk: bytes) -> str:
    """Base64-encode a PCM audio chunk for WebSocket transport."""
    return base64.b64encode(chunk).decode("utf-8")


async def playback_audio(
    playback_queue: asyncio.Queue,
    stop_event: asyncio.Event,
) -> None:
    """Stream assistant audio back to the speakers in (near) real time."""

    try:
        with sd.RawOutputStream(
            samplerate=DEFAULT_SAMPLE_RATE,
            channels=1,
            dtype="int16",
        ) as stream:
            while not stop_event.is_set():
                chunk = await playback_queue.get()
                if chunk is None:
                    break
                try:
                    stream.write(chunk)
                except Exception as exc:
                    print(f"Audio playback error: {exc}", flush=True)
                    break
    except Exception as exc:
        print(f"Failed to open audio output stream: {exc}", flush=True)


async def send_audio_from_queue(
    ws: WebSocketClientProtocol,
    queue: asyncio.Queue[bytes | None],
    stop_event: asyncio.Event,
) -> None:
    """Push raw PCM chunks into input_audio_buffer via the WebSocket."""

    while not stop_event.is_set():
        chunk = await queue.get()
        if chunk is None:
            break
        encoded_chunk = encode_audio(chunk)
        message = {"type": "input_audio_buffer.append", "audio": encoded_chunk}
        await ws.send(json.dumps(message))

    if not ws.closed:
        commit_payload = {"type": "input_audio_buffer.commit"}
        await ws.send(json.dumps(commit_payload))


async def stream_microphone_audio(
    ws: WebSocketClientProtocol,
    stop_event: asyncio.Event,
    shared_state: dict,
    block_ms: int = DEFAULT_BLOCK_MS,
) -> None:
    """Capture live microphone audio and send it to the realtime session."""

    loop = asyncio.get_running_loop()
    audio_queue: asyncio.Queue[bytes | None] = asyncio.Queue()
    blocksize = int(DEFAULT_SAMPLE_RATE * (block_ms / 1000))

    def on_audio(indata, frames, time_info, status):  # type: ignore[override]
        """Capture a mic callback chunk and enqueue it unless the mic is muted."""
        if status:
            print(f"Microphone status: {status}", flush=True)
        # Simple echo protection: mute mic when assistant is talking
        if not stop_event.is_set() and not shared_state.get("mute_mic", False):
            data = bytes(indata)
            loop.call_soon_threadsafe(audio_queue.put_nowait, data)

    print(
        f"Streaming microphone audio at {DEFAULT_SAMPLE_RATE} Hz (mono). "
        "Speak naturally; server VAD will stop listening when you pause."
    )
    sender = asyncio.create_task(send_audio_from_queue(ws, audio_queue, stop_event))

    with sd.RawInputStream(
        samplerate=DEFAULT_SAMPLE_RATE,
        blocksize=blocksize,
        channels=1,
        dtype="int16",
        callback=on_audio,
    ):
        await stop_event.wait()

    await audio_queue.put(None)
    await sender

7. Extracting and comparing transcripts

The function below enables us to generate two transcripts for each user turn:

  • Realtime model transcript: from our out-of-band response.create call.
  • Built-in ASR transcript: from the standard transcription model (input_audio_transcription_model).

We align and display both clearly in the terminal:

=== User Turn (Realtime Transcript) ===
...

=== User Turn (Built-in ASR Transcript) ===
...
def flush_pending_transcription_prints(shared_state: dict) -> None:
    """Whenever we've printed a realtime transcript, print the matching transcription-model output."""

    pending_prints: deque | None = shared_state.get("pending_transcription_prints")
    input_transcripts: deque | None = shared_state.get("input_transcripts")

    if not pending_prints or not input_transcripts:
        return

    while pending_prints and input_transcripts:
        comparison_text = input_transcripts.popleft()
        pending_prints.popleft()
        print("=== User turn (Transcription model) ===")
        if comparison_text:
            print(comparison_text, flush=True)
            print()
        else:
            print("<not available>", flush=True)
            print()

8. Listening for Realtime events

listen_for_events drives the session:

  • Watches for speech_started / speech_stopped / committed
  • Sends the out‑of‑band transcription request when a user turn finishes (input_audio_buffer.committed)
  • Streams assistant audio to the playback queue
  • Buffers text deltas per response_id
async def listen_for_events(
    ws: WebSocketClientProtocol,
    stop_event: asyncio.Event,
    transcription_instructions: str,
    max_turns: int | None,
    playback_queue: asyncio.Queue,
    shared_state: dict,
) -> None:
    """Print assistant text + transcripts and coordinate mic muting."""

    responses: dict[str, dict[str, bool]] = {}
    buffers: defaultdict[str, str] = defaultdict(str)
    transcription_model_buffers: defaultdict[str, str] = defaultdict(str)
    completed_main_responses = 0
    awaiting_transcription_prompt = False
    input_transcripts = shared_state.setdefault("input_transcripts", deque())
    pending_transcription_prints = shared_state.setdefault(
        "pending_transcription_prints", deque()
    )

    async for raw in ws:
        if stop_event.is_set():
            break

        message = json.loads(raw)
        message_type = message.get("type")

        # --- User speech events -------------------------------------------------
        if message_type == "input_audio_buffer.speech_started":
            print("\n[client] Speech detected; streaming...", flush=True)
            awaiting_transcription_prompt = True

        elif message_type in INPUT_SPEECH_END_EVENT_TYPES:
            if message_type == "input_audio_buffer.speech_stopped":
                print("[client] Detected silence; preparing transcript...", flush=True)

            # This is where the out-of-band transcription request is sent. <-------
            if awaiting_transcription_prompt:
                request_payload = build_transcription_request(
                    transcription_instructions
                )
                await ws.send(json.dumps(request_payload))
                awaiting_transcription_prompt = False

        # --- Built-in transcription model stream -------------------------------
        elif message_type in TRANSCRIPTION_DELTA_TYPES:
            buffer_id = message.get("buffer_id") or message.get("item_id") or "default"
            delta_text = (
                message.get("delta")
                or (message.get("transcription") or {}).get("text")
                or ""
            )
            if delta_text:
                transcription_model_buffers[buffer_id] += delta_text

        elif message_type in TRANSCRIPTION_COMPLETE_TYPES:
            buffer_id = message.get("buffer_id") or message.get("item_id") or "default"
            final_text = (
                (message.get("transcription") or {}).get("text")
                or message.get("transcript")
                or ""
            )
            if not final_text:
                final_text = transcription_model_buffers.pop(buffer_id, "").strip()
            else:
                transcription_model_buffers.pop(buffer_id, None)

            if not final_text:
                item = message.get("item")
                if item:
                    final_text = item.get("transcription")
                final_text = final_text or ""

            final_text = final_text.strip()
            if final_text:
                input_transcripts.append(final_text)
                flush_pending_transcription_prints(shared_state)

        # --- Response lifecycle (Realtime model) --------------------------------
        elif message_type == "response.created":
            response = message.get("response", {})
            response_id = response.get("id")
            metadata = response.get("metadata") or {}
            responses[response_id] = {
                "is_transcription": metadata.get("purpose") == TRANSCRIPTION_PURPOSE,
                "done": False,
            }

        elif message_type in RESPONSE_AUDIO_DELTA_TYPES:
            response_id = message.get("response_id")
            if response_id is None:
                continue
            b64_audio = message.get("delta") or message.get("audio")
            if not b64_audio:
                continue
            try:
                audio_chunk = base64.b64decode(b64_audio)
            except Exception:
                continue

            if (
                response_id in responses
                and not responses[response_id]["is_transcription"]
            ):
                shared_state["mute_mic"] = True

            await playback_queue.put(audio_chunk)

        elif message_type in RESPONSE_TEXT_DELTA_TYPES:
            response_id = message.get("response_id")
            if response_id is None:
                continue
            buffers[response_id] += message.get("delta", "")
            

        elif message_type in RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES:
            response_id = message.get("response_id")
            if response_id is None:
                continue
            buffers[response_id] += message.get("delta", "")        

        elif message_type == "response.done":
            response = message.get("response", {})
            response_id = response.get("id")
            if response_id is None:
                continue
            if response_id not in responses:
                responses[response_id] = {"is_transcription": False, "done": False}
            responses[response_id]["done"] = True

            is_transcription = responses[response_id]["is_transcription"]
            text = buffers.get(response_id, "").strip()
            if text:
                if is_transcription:
                    print("\n=== User turn (Realtime transcript) ===")
                    print(text, flush=True)
                    print()
                    pending_transcription_prints.append(object())
                    flush_pending_transcription_prints(shared_state)
                else:
                    print("\n=== Assistant response ===")
                    print(text, flush=True)
                    print()

            if not is_transcription:
                shared_state["mute_mic"] = False
                completed_main_responses += 1

                if max_turns is not None and completed_main_responses >= max_turns:
                    stop_event.set()
                    break

        elif message_type == "error":
            print(f"Error from server: {message}")

        else:
            pass

        await asyncio.sleep(0)

9. Run Script

In this step, we run the the code which will allow us to view the realtime model transcription vs transcription model transcriptions. The code does the following:

  • Loads configuration and prompts
  • Establishes a WebSocket connection
  • Starts concurrent tasks:
    • listen_for_events (handle incoming messages)
    • stream_microphone_audio (send microphone audio)
    • Mutes mic when assistant is speaking
    • playback_audio (play assistant responses)
    • prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.
  • Run session until you interrupt

Output should look like:

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
 
=== User turn (Realtime transcript) ===
Hello.
 
=== User turn (Transcription model) ===
Hello
 
 
=== Assistant response ===
Hello, and thank you for calling. Let's start with your full name, please.
async def run_realtime_session(
    api_key: str | None = None,
    server: str = "wss://api.openai.com/v1/realtime",
    model: str = DEFAULT_MODEL,
    voice: str = DEFAULT_VOICE,
    instructions: str = REALTIME_MODEL_PROMPT,
    transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,
    input_audio_transcription_model: str | None = "gpt-4o-transcribe",
    silence_duration_ms: int = DEFAULT_SILENCE_DURATION_MS,
    prefix_padding_ms: int = DEFAULT_PREFIX_PADDING_MS,
    vad_threshold: float = 0.6,
    idle_timeout_ms: int | None = None,
    max_turns: int | None = None,
    timeout_seconds: int = 0,
) -> None:
    """Connect to the Realtime API, stream audio both ways, and print transcripts."""
    api_key = api_key or os.environ.get("OPENAI_API_KEY")
    ws_url = f"{server}?model={model}"
    headers = {
        "Authorization": f"Bearer {api_key}",
    }

    session_update_payload = build_session_update(
        instructions=instructions,
        voice=voice,
        vad_threshold=vad_threshold,
        silence_duration_ms=silence_duration_ms,
        prefix_padding_ms=prefix_padding_ms,
        idle_timeout_ms=idle_timeout_ms,
        input_audio_transcription_model=input_audio_transcription_model,
    )
    stop_event = asyncio.Event()
    playback_queue: asyncio.Queue = asyncio.Queue()
    shared_state: dict = {
        "mute_mic": False,
        "input_transcripts": deque(),
        "pending_transcription_prints": deque(),
    }

    async with websockets.connect(
        ws_url, additional_headers=headers, max_size=None
    ) as ws:
        await ws.send(json.dumps(session_update_payload))

        listener_task = asyncio.create_task(
            listen_for_events(
                ws,
                stop_event=stop_event,
                transcription_instructions=transcription_instructions,
                max_turns=max_turns,
                playback_queue=playback_queue,
                shared_state=shared_state,
            )
        )
        mic_task = asyncio.create_task(
            stream_microphone_audio(ws, stop_event, shared_state=shared_state)
        )
        playback_task = asyncio.create_task(playback_audio(playback_queue, stop_event))

        try:
            if timeout_seconds and timeout_seconds > 0:
                await asyncio.wait_for(stop_event.wait(), timeout=timeout_seconds)
            else:
                await stop_event.wait()
        except asyncio.TimeoutError:
            print("Timed out waiting for responses; closing.")
        except asyncio.CancelledError:
            print("Session cancelled; closing.")
        finally:
            stop_event.set()
            await playback_queue.put(None)
            await ws.close()
            await asyncio.gather(
                listener_task, mic_task, playback_task, return_exceptions=True
            )
await run_realtime_session()
Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Hello.

=== User turn (Transcription model) ===
Hello


=== Assistant response ===
Hello! Let's get started with your claim. Can you tell me your full name, please?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My name is M I N H A J U L H O Q U E

=== User turn (Transcription model) ===
My name is Minhajul Hoque.


=== Assistant response ===
Thank you. Just to confirm, I heard your full name as Minhajul Hoque. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Yep.

=== User turn (Transcription model) ===
Yep.


=== Assistant response ===
Great, thank you for confirming. Now, could you provide your policy number, please?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My policy number is X077-B025.

=== User turn (Transcription model) ===
My policy number is X077B025.


=== Assistant response ===
Thank you. Let me confirm: I have your policy number as X077B025. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== Assistant response ===
Of course. Your full name is Minhajul Hoque. Now, let’s move on. What type of accident are you reporting—auto, home, or something else?


=== User turn (Realtime transcript) ===
Yeah, can you ask me my name again?

=== User turn (Transcription model) ===
Can you ask me my name again?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
No, can you ask me my name again, this is important.

=== User turn (Transcription model) ===
No, can you ask me by name again?


=== Assistant response ===
Understood. Let me repeat your full name again to confirm. Your name is Minhajul Hoque. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My name is Minhajul Hoque.

=== User turn (Transcription model) ===
My name is Minhaj ul Haq.

Session cancelled; closing.

From the above example, we can notice:

  • The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses "this is important." while the realtime transcription gets it correctly.
  • The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).
  • With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., "Minhaj ul Haq").

Conclusion

Exploring out-of-band transcription could be beneficial for your use case if:

  • You're still experiencing unreliable transcriptions, even after optimizing the transcription model prompt.
  • You need a more reliable and steerable method for generating transcriptions.
  • The current transcripts fail to normalize entities correctly, causing downstream issues.

If you decide to pursue this method, make sure you:

  • Set up the transcription trigger correctly, ensuring it activates after the audio commit.
  • Carefully iterate and refine the prompt to align closely with your specific use case and needs.