Voice-First Document Editing with Gemini Live: Architecture & Tradeoffs

Architecture April 10, 2026 · 7 min read

Most "voice-controlled editor" demos you've seen are doing something quietly expensive: the audio goes through a transcription model, the transcript goes through an instruction model, and the instruction model emits tool calls into the editor. Three round-trips, two LLMs, and the surface that the voice model sees is not the surface the editor exposes.

AgentDoc takes a different route. We connect Gemini 3.1 Flash Live – the native-audio variant – directly to the same MCP server the text agent uses. There is no transcription proxy, no second LLM, and no separate tool surface for voice. This post walks through why we chose this design, what it costs, and where the rough edges still are.

The naïve design (and why it loses)

The transcription-proxy pattern looks like this:

mic → STT → transcript → text-LLM → tool-call → MCP → editor
                              ↑
                        (instructions, tool list, history)

It has three real problems. First, latency is the sum of three sequential model calls; you feel the pause every time you finish speaking. Second, the STT step throws away prosody – which is exactly the signal that distinguishes "italic 'really'" from "really, italic". Third, you maintain two divergent prompt surfaces: the text agent's tool docs and the voice agent's, which drift apart over time.

The native-audio design

Gemini Live's native-audio path takes raw audio, runs tool calls against a typed schema, and streams audio back. The diagram collapses to:

mic → Gemini Live → tool-call → MCP → editor
                      ↑
                (same tool list as text agent)

Two consequences fall out:

Single source of truth for the tool surface. The voice agent calls insert_text with the same schema the text agent uses. We don't maintain a parallel "voice tool list".
Prosody survives. The model receives the audio directly, so emphasis and pauses inform the tool-call selection without us writing rules.

What we had to build

Three integration points needed work:

WebSocket bridging. Gemini Live speaks one WS protocol; the editor uses another for real-time document sync. The agent service runs both and translates tool-call results into editor-render events. See [agent/](agent/).
Tool-call observability. The voice model needs structured feedback to avoid index drift after mutations. We feed back the same dirty-range descriptors the text agent gets (covered in the tool granularity post).
FSM gating, voice variant. The State-Constrained ReAct FSM doesn't care about the modality – the lock on write tools after a mutation works identically for voice. We just had to make sure the audio output channel doesn't deadlock when the FSM forces a re-read.

Latency: the win we expected and the win we didn't

End-to-end "I stopped speaking → document mutates" latency drops from roughly 2.4s in the proxy design to ~700ms with native audio. The expected win.

The unexpected win is what happens during multi-turn flows. Because we don't pay the STT round-trip per turn, the model can hold a longer context of spoken conversation cheaply. Compound instructions ("make the title bold, then make the next paragraph italic, then export") that previously required us to chunk the transcript into separate LLM calls now run as a single tool-call sequence in one Live session.

What the rough edges look like

Three things still bite:

Live API quota. Native audio sessions count against a different quota than text completions, and a long voice session can consume it faster than a typist would expect. Our quota-status pill (the new quota_status.js module) is the user-visible answer.
Disambiguation across homophones. "Insert here" and "insert hear" don't sound different to a microphone in a noisy room. We added a small set of confirmation prompts on irreversible operations – speak-aloud "deleting paragraph 3, confirm?" – that gate the actual mutation.
Reconnects. The Live WebSocket can drop on flaky networks, and the audio buffer state at the moment of drop is not always recoverable. We currently replay the last ~3 seconds of confirmed-text into a fresh session, which is good enough for most users but is the next thing on the polish list.

Why this matters for accessibility

The original motivation for the voice-first build is on the accessibility page: AgentDoc is intended to be usable by people who cannot or do not want to use a mouse and keyboard. The transcription-proxy design always felt wrong for that audience because it imposed a 2-second penalty on every utterance, which compounds badly across a long writing session.

Native audio is not just an engineering preference. It's the difference between voice as a novelty input and voice as a primary one.

What's next

Ongoing work: better reconnect handling, a quota-aware fallback to text-mode when the Live budget is low, and a study comparing voice-only vs. text-only completion times across the 13 benchmark scenarios. That last one will get its own write-up here when the data lands.

← Tool Granularity in LLM Agents All posts →