Skip to content

bug(voice): agent completion readback unreliable (all agents; worst on stream-json flavors) #681

@heavygee

Description

@heavygee

Summary

When a coding agent finishes (ready event), the ElevenLabs voice assistant often does not read back the agent's answer reliably. The user may need to ask multiple times; ConvAI sometimes hallucinates a partial summary.

This affects all agent flavors, but not equally - see below.

Two related gaps in web/src/realtime/hooks/contextFormatters.ts:

  1. formatReadyEvent (older behavior): injected text told ConvAI "the previous message(s) are the summary" without embedding assistant text inline. All flavors - ConvAI is pointed at context that may not contain the answer, with no inline fallback.
  2. formatMessage: does not format stream-json / codex-family payloads ({ type: "codex", data: { type: "message", message: "..." } }), so onMessages and session history context updates are empty for those sessions. Claude-style content arrays (text blocks, plain strings) do work.

Affected agents (how)

Flavor Ready inject gap Live context (onMessages / history) Typical outcome
Claude Yes Usually populated (text blocks reach formatMessage) Unreliable readback - context may help, but ready inject does not embed text; still fails in dogfood
Codex Yes Empty (stream-json not formatted) Readback usually fails; hallucination risk if user presses for a summary
Cursor Yes Empty (same stream-json shape as Codex) Same as Codex
Gemini, Kimi, OpenCode Yes Empty if using codex-family stream-json envelopes Same as Codex/Cursor
All Yes Varies Even when live context exists, old ready text did not paste the final answer inline

Bottom line: the ready-event bug is universal. Codex/Cursor (and other stream-json flavors) also had silent context - no history and no live updates - so they were hit hardest.

Expected

On ready, voice should receive the last assistant speakable text inline (e.g. wrapped in <text>…</text>) so ConvAI can summarize immediately without the user re-prompting. Stream-json messages should flow through formatMessage for live updates and session history.

Repro

Any flavor (ready gap):

  1. Open a session with voice enabled
  2. Ask a factual question
  3. Agent replies in chat with a substantive answer
  4. Voice signals "finished" but does not convey the answer unless the user re-prompts

Codex / Cursor / stream-json flavors (additional context gap):

Same steps; voice also lacks interim agent text in context between messages, so readback fails more consistently and confabulation is more likely.

Evidence

Dogfood on ElevenLabs ConvAI: ready inject appeared as a fake user turn with no embedded assistant text; when context was thin (especially Codex/Cursor sessions), assistant confabulated instead of reading hub message content. A Cursor session with the fix applied showed embedded <text>…</text> at ready and accurate readback one second later (conv conv_4501ksdt0athfhfr189tq3jehkcq).

Suggested fix (minimal)

  • extractLastAssistantSpeakable() + embed in formatReadyEvent() (ready hook) - all flavors
  • Teach formatMessage() the codex stream-json path so live onMessages context is not empty - Codex/Cursor/stream-json flavors

Might be related to open PR #640 (Codex voice context) - not sure if that covers ready readback specifically.

Draft fix: PR #682

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions