GPT-4o-Realtime supports a 128k token context window, though in certain use cases, you may notice performance degrade as you stuff more tokens into the context window.
Token window = all tokens (words and audio tokens) the model currently keeps in memory for the session.x
# Set your API key safelyopenai.api_key = os.getenv("OPENAI_API_KEY", "")ifnot openai.api_key:raiseValueError("OPENAI_API_KEY not found – please set env var or edit this cell.")
Large‑token windows are precious, every extra token you use costs latency + money.
For audio the input token window increases much faster than for plain text because amplitude, timing, and other acoustic details must be represented.
In practice you’ll often see ≈ 10 × more tokens for the same sentence in audio versus text.
GPT-4o realtime accepts up to 128k tokens and as the token size increases, instruction adherence can drift.
Every user/assistant turn consumes tokens → the window only grows.
Strategy: Summarise older turns into a single assistant message, keep the last few verbatim turns, and continue.
Unlike HTTP-based Chat Completions, the Realtime API maintains an open, stateful session with two key components:
Component
Purpose
Session
Controls global settings — model, voice, modalities, VAD, etc.
Conversation
Stores turn-by-turn messages between user and assistant — both audio and text.
This notebook wraps these components inside a simple ConversationState object to keep your logic clean, track history, and manage summarization when context windows fill up.
@dataclassclassTurn:"""One utterance in the dialogue (user **or** assistant).""" role: Literal["user", "assistant"] item_id: str# Server‑assigned identifier text: str|None=None# Filled once transcript is ready@dataclassclassConversationState:"""All mutable data the session needs — nothing more, nothing less.""" history: List[Turn] = field(default_factory=list) # Ordered log waiting: dict[str, asyncio.Future] = field(default_factory=dict) # Pending transcript fetches summary_count: int=0 latest_tokens: int=0# Window size after last reply summarising: bool=False# Guard so we don’t run two summaries at once
A quick helper to peek at the transcript:
defprint_history(state) -> None:"""Pretty-print the running transcript so far."""print("—— Conversation so far ———————————————")for turn in state.history: text_preview = (turn.text or"").strip().replace("\n", " ")print(f"[{turn.role:<9}] {text_preview} ({turn.item_id})")print("——————————————————————————————————————————")
Our mic task is now filling an asyncio.Queue with raw PCM‑16 blocks.
Next step: pull chunks off that queue, base‑64 encode them (the protocol requires JSON‑safe text), and ship each block to the Realtime WebSocket as an input_audio_buffer.append event.
# Helper function to encode audio chunks in base64b64 =lambda blob: base64.b64encode(blob).decode()asyncdefqueue_to_websocket(pcm_queue: asyncio.Queue[bytes], ws):"""Read audio chunks from queue and send as JSON events."""try:while (chunk :=await pcm_queue.get()) isnotNone:await ws.send(json.dumps({"type": "input_audio_buffer.append","audio": b64(chunk), }))except websockets.ConnectionClosed:print("WebSocket closed – stopping uploader")
Once audio reaches the server, the Realtime API pushes a stream of JSON events back over the same WebSocket.
Understanding these events is critical for:
The Realtime model keeps a large 128 k‑token window, but quality can drift long before that limit as you stuff more context into the model.
Our goal: auto‑summarise once the running window nears a safe threshold (default 2 000 tokens for the notebook), then prune the superseded turns both locally and server‑side.
We monitor latest_tokens returned in response.done. When it exceeds SUMMARY_TRIGGER and we have more than KEEP_LAST_TURNS, we spin up a background summarisation coroutine.
We compress everything except the last 2 turns into a single French paragraph, then:
Insert that paragraph as a new assistant message at the top of the conversation.
Delete the message items that was used for the summary.
We will later ask the Voice agent what language was the summary to test if the Summary insertion into Realtime API Conversation Context was successful.
asyncdefrun_summary_llm(text: str) -> str:"""Call a lightweight model to summarise `text`.""" resp =await asyncio.to_thread(lambda: openai.chat.completions.create(model=SUMMARY_MODEL,temperature=0,messages=[ {"role": "system", "content": "Summarise in French the following conversation ""in one concise paragraph so it can be used as ""context for future dialogue."}, {"role": "user", "content": text}, ], ))return resp.choices[0].message.content.strip()
asyncdefsummarise_and_prune(ws, state):"""Summarise old turns, delete them server‑side, and prepend a single summary turn locally + remotely.""" state.summarising =Trueprint(f"⚠️ Token window ≈{state.latest_tokens} ≥ {SUMMARY_TRIGGER}. Summarising…", ) old_turns, recent_turns = state.history[:-KEEP_LAST_TURNS], state.history[-KEEP_LAST_TURNS:] convo_text ="\n".join(f"{t.role}: {t.text}"for t in old_turns if t.text)ifnot convo_text:print("Nothing to summarise (transcripts still pending).") state.summarising =False summary_text =await run_summary_llm(convo_text) if convo_text else"" state.summary_count +=1 summary_id =f"sum_{state.summary_count:03d}" state.history[:] = [Turn("assistant", summary_id, summary_text)] + recent_turns print_history(state) # Create summary on serverawait ws.send(json.dumps({"type": "conversation.item.create","previous_item_id": "root","item": {"id": summary_id,"type": "message","role": "assistant","content": [{"type": "text", "text": summary_text}], }, }))# Delete old itemsfor turn in old_turns:await ws.send(json.dumps({"type": "conversation.item.delete","item_id": turn.item_id, }))print(f"✅ Summary inserted ({summary_id})") state.summarising =False
The following function lets us poll for transcripts over time. This is useful in cases where the user's audio hasn't been transcribed immediately, so we can retrieve the final result later.
asyncdeffetch_full_item( ws, item_id: str, state: ConversationState, attempts: int=1):""" Ask the server for a full conversation item; retry up to 5× if the transcript field is still null. Resolve the waiting future when done. """# If there is already a pending fetch, just await itif item_id in state.waiting:returnawait state.waiting[item_id] fut = asyncio.get_running_loop().create_future() state.waiting[item_id] = futawait ws.send(json.dumps({"type": "conversation.item.retrieve","item_id": item_id, })) item =await fut# If transcript still missing retry (max 5×)if attempts <5andnot item.get("content", [{}])[0].get("transcript"):await asyncio.sleep(0.4* attempts)returnawait fetch_full_item(ws, item_id, state, attempts +1)# Done – remove the marker state.waiting.pop(item_id, None)return item
Run the two cells below to launch an interactive session. Interrupt the cell stop recording.
Note:
This notebook uses SUMMARY_TRIGGER = 2000 and KEEP_LAST_TURNS = 2 to make summarization easier to demo quickly.
In production, you should tune these values based on your application's needs.
A typical SUMMARY_TRIGGER falls between 20,000–32,000 tokens, depending on how performance degrades with larger context for your use case.
# Audio/config knobsSAMPLE_RATE_HZ=24_000# Required by pcm16CHUNK_DURATION_MS=40# chunk size for audio captureBYTES_PER_SAMPLE=2# pcm16 = 2 bytes/sampleSUMMARY_TRIGGER=2_000# Summarise when context ≥ thisKEEP_LAST_TURNS=2# Keep these turns verbatimSUMMARY_MODEL="gpt-4o-mini"# Cheaper, fast summariser
# --------------------------------------------------------------------------- ## 🎤 Realtime session ## --------------------------------------------------------------------------- #asyncdefrealtime_session(model="gpt-4o-realtime-preview", voice="shimmer", enable_playback=True):""" Main coroutine: connects to the Realtime endpoint, spawns helper tasks, and processes incoming events in a big async‑for loop. """ state = ConversationState() # Reset state for each run pcm_queue: asyncio.Queue[bytes] = asyncio.Queue() assistant_audio: List[bytes] = []# ----------------------------------------------------------------------- ## Open the WebSocket connection to the Realtime API ## ----------------------------------------------------------------------- # url =f"wss://api.openai.com/v1/realtime?model={model}" headers = {"Authorization": f"Bearer {openai.api_key}", "OpenAI-Beta": "realtime=v1"}asyncwith websockets.connect(url, extra_headers=headers, max_size=1<<24) as ws:# ------------------------------------------------------------------- ## Wait until server sends session.created ## ------------------------------------------------------------------- #while json.loads(await ws.recv())["type"] !="session.created":passprint("session.created ✅")# ------------------------------------------------------------------- ## Configure session: voice, modalities, audio formats, transcription ## ------------------------------------------------------------------- #await ws.send(json.dumps({"type": "session.update","session": {"voice": voice,"modalities": ["audio", "text"],"input_audio_format": "pcm16","output_audio_format": "pcm16","input_audio_transcription": {"model": "gpt-4o-transcribe"}, }, }))# ------------------------------------------------------------------- ## Launch background tasks: mic capture → queue → websocket ## ------------------------------------------------------------------- # mic_task = asyncio.create_task(mic_to_queue(pcm_queue)) upl_task = asyncio.create_task(queue_to_websocket(pcm_queue, ws))print("🎙️ Speak now (Ctrl‑C to quit)…")try:# ------------------------------------------------------------------- ## Main event loop: process incoming events from the websocket ## ------------------------------------------------------------------- #asyncfor event_raw in ws: event = json.loads(event_raw) etype = event["type"]# --------------------------------------------------------------- ## User just spoke ⇢ conversation.item.created (role = user) ## --------------------------------------------------------------- #if etype =="conversation.item.created"and event["item"]["role"] =="user": item = event["item"] text =Noneif item["content"]: text = item["content"][0].get("transcript") state.history.append(Turn("user", event["item"]["id"], text))# If transcript not yet available, fetch it laterif text isNone: asyncio.create_task(fetch_full_item(ws, item["id"], state))# --------------------------------------------------------------- ## Transcript fetched ⇢ conversation.item.retrieved ## --------------------------------------------------------------- #elif etype =="conversation.item.retrieved": content = event["item"]["content"][0]# Fill missing transcript in historyfor t in state.history:if t.item_id == event["item"]["id"]: t.text = content.get("transcript")break# --------------------------------------------------------------- ## Assistant audio arrives in deltas ## --------------------------------------------------------------- #elif etype =="response.audio.delta": assistant_audio.append(base64.b64decode(event["delta"]))# --------------------------------------------------------------- ## Assistant reply finished ⇢ response.done ## --------------------------------------------------------------- #elif etype =="response.done":for item in event["response"]["output"]:if item["role"] =="assistant": txt = item["content"][0]["transcript"] state.history.append(Turn("assistant", item["id"], txt))# print(f"\n🤖 {txt}\n") state.latest_tokens = event["response"]["usage"]["total_tokens"]print(f"—— response.done (window ≈{state.latest_tokens} tokens) ——") print_history(state)# Fetch any still‑missing user transcriptsfor turn in state.history:if (turn.role =="user"and turn.text isNoneand turn.item_id notin state.waiting): asyncio.create_task( fetch_full_item(ws, turn.item_id, state) )# Playback collected audio once reply completesif enable_playback and assistant_audio: simpleaudio.play_buffer(b"".join(assistant_audio), 1, BYTES_PER_SAMPLE, SAMPLE_RATE_HZ) assistant_audio.clear()# Summarise if context too large – fire in background so we don't block dialogueif state.latest_tokens >=SUMMARY_TRIGGERandlen(state.history) >KEEP_LAST_TURNSandnot state.summarising: asyncio.create_task(summarise_and_prune(ws, state))exceptKeyboardInterrupt:print("\nStopping…")finally: mic_task.cancel()await pcm_queue.put(None)await upl_task
# Run the realtime session (this cell blocks until you stop it)await realtime_session()
session.created ✅🎙️ Speak now (Ctrl‑C to quit)…—— response.done (window ≈979 tokens) ———— Conversation so far ———————————————[user ] Can you tell me a quick story? (item_BTuMOcpUqp8qknKhLzlkA)[assistant] Once upon a time, in a cozy little village, there was a cat named Whiskers who was always getting into trouble. One sunny day, Whiskers found a mysterious glowing stone in the garden. Curious, he pawed at it, and poof! The stone granted him the ability to talk to birds. Whiskers and his new bird friends had grand adventures, solving mysteries and exploring the village. And from that day on, Whiskers was known as the most adventurous cat in the village. The end. (item_BTuMPRWxqpv0ph6QM46DK)———————————————————————————————————————————— response.done (window ≈2755 tokens) ———— Conversation so far ———————————————[user ] Can you tell me a quick story? (item_BTuMOcpUqp8qknKhLzlkA)[assistant] Once upon a time, in a cozy little village, there was a cat named Whiskers who was always getting into trouble. One sunny day, Whiskers found a mysterious glowing stone in the garden. Curious, he pawed at it, and poof! The stone granted him the ability to talk to birds. Whiskers and his new bird friends had grand adventures, solving mysteries and exploring the village. And from that day on, Whiskers was known as the most adventurous cat in the village. The end. (item_BTuMPRWxqpv0ph6QM46DK)[user ] Can you tell me three extremely funny stories? (item_BTuNN64LdULM21OyC4vzN)[assistant] Sure, let's dive into some giggle-worthy tales: **Story One:** There was a forgetful baker named Benny who baked a hundred cakes for a big wedding. But on the big day, he forgot where he put them! The entire town joined in to find the missing cakes, only to discover Benny had stored them in his neighbor's garage, thinking it was his pantry. The wedding turned into a town-wide cake feast! **Story Two:** A mischievous dog named Sparky loved to play pranks. One day, he swapped his owner's phone with a squeaky toy, causing a hilarious mix-up of barks, squeaks, and confused calls. Sparky's owner ended up having a full conversation with the mailman, all in squeaks! **Story Three:** In a small town, a parrot named Polly became a local celebrity for reciting tongue twisters. One day, Polly challenged the mayor to a tongue twister duel. The mayor, tongue-tied and laughing, declared Polly the official town jester. Polly squawked with pride, and the town rang with laughter for days. (item_BTuNNpNxki5ynSQ5c3Xsa)——————————————————————————————————————————⚠️ Token window ≈2755 ≥ 2000. Summarising…—— Conversation so far ———————————————[assistant] L'utilisateur a demandé une histoire rapide, et l'assistant a raconté celle d'un chat nommé Whiskers qui, après avoir trouvé une pierre mystérieuse dans son jardin, a obtenu le pouvoir de parler aux oiseaux. Avec ses nouveaux amis oiseaux, Whiskers a vécu de grandes aventures, résolvant des mystères et explorant le village, devenant ainsi le chat le plus aventurier du village. (sum_001)[user ] Can you tell me three extremely funny stories? (item_BTuNN64LdULM21OyC4vzN)[assistant] Sure, let's dive into some giggle-worthy tales: **Story One:** There was a forgetful baker named Benny who baked a hundred cakes for a big wedding. But on the big day, he forgot where he put them! The entire town joined in to find the missing cakes, only to discover Benny had stored them in his neighbor's garage, thinking it was his pantry. The wedding turned into a town-wide cake feast! **Story Two:** A mischievous dog named Sparky loved to play pranks. One day, he swapped his owner's phone with a squeaky toy, causing a hilarious mix-up of barks, squeaks, and confused calls. Sparky's owner ended up having a full conversation with the mailman, all in squeaks! **Story Three:** In a small town, a parrot named Polly became a local celebrity for reciting tongue twisters. One day, Polly challenged the mayor to a tongue twister duel. The mayor, tongue-tied and laughing, declared Polly the official town jester. Polly squawked with pride, and the town rang with laughter for days. (item_BTuNNpNxki5ynSQ5c3Xsa)——————————————————————————————————————————✅ Summary inserted (sum_001)—— response.done (window ≈2147 tokens) ———— Conversation so far ———————————————[assistant] L'utilisateur a demandé une histoire rapide, et l'assistant a raconté celle d'un chat nommé Whiskers qui, après avoir trouvé une pierre mystérieuse dans son jardin, a obtenu le pouvoir de parler aux oiseaux. Avec ses nouveaux amis oiseaux, Whiskers a vécu de grandes aventures, résolvant des mystères et explorant le village, devenant ainsi le chat le plus aventurier du village. (sum_001)[user ] Can you tell me three extremely funny stories? (item_BTuNN64LdULM21OyC4vzN)[assistant] Sure, let's dive into some giggle-worthy tales: **Story One:** There was a forgetful baker named Benny who baked a hundred cakes for a big wedding. But on the big day, he forgot where he put them! The entire town joined in to find the missing cakes, only to discover Benny had stored them in his neighbor's garage, thinking it was his pantry. The wedding turned into a town-wide cake feast! **Story Two:** A mischievous dog named Sparky loved to play pranks. One day, he swapped his owner's phone with a squeaky toy, causing a hilarious mix-up of barks, squeaks, and confused calls. Sparky's owner ended up having a full conversation with the mailman, all in squeaks! **Story Three:** In a small town, a parrot named Polly became a local celebrity for reciting tongue twisters. One day, Polly challenged the mayor to a tongue twister duel. The mayor, tongue-tied and laughing, declared Polly the official town jester. Polly squawked with pride, and the town rang with laughter for days. (item_BTuNNpNxki5ynSQ5c3Xsa)[user ] (item_BTuPLaCv8ATdIwAQ2rLgO)[assistant] Sure! The first summary I provided between us was in French. (item_BTuPLa7BaSQToGCVOmfBK)
We had a conversation with our Voice AI. After several turns, the total token count reached SUMMARY_MAX, which triggered the conversation summarization step. This generated a summary of the earlier messages.
Since there were N = 4 total messages, we summarized the first N - 2 = 2 messages:
—— Conversation so far ———————————————[user ] Can you tell me a quick story? (item_BTuMOcpUqp8qknKhLzlkA)[assistant] Once upon a time, in a cozy little village, there was a cat named Whiskers who was always getting into trouble. One sunny day, Whiskers found a mysterious glowing stone in the garden. Curious, he pawed at it, and poof! The stone granted him the ability to talk to birds. Whiskers and his new bird friends had grand adventures, solving mysteries and exploring the village. And from that day on, Whiskers was known as the most adventurous cat in the village. The end. (item_BTuMPRWxqpv0ph6QM46DK)
We then created a summary in French and inserted it into the conversation history using the root: true flag. This ensured the summary appeared as the first message in the conversation. After that, we deleted the original items, using "type": "conversation.item.delete", that were summarized.
To validate the summary insertion, we asked the Voice AI what language the summary was in. It correctly responded:
[assistant] Sure! The first summary I provided between us was in French. (item_BTuPLa7BaSQToGCVOmfBK)