Voice-First Document Editing with Gemini Live: Architecture & Tradeoffs

Architecture Β· 7 min read

Most "voice-controlled editor" demos you've seen are doing something quietly expensive: the audio goes through a transcription model, the transcript goes through an instruction model, and the instruction model emits tool calls into the editor. Three round-trips, two LLMs, and the surface that the voice model sees is not the surface the editor exposes.

AgentDoc takes a different route. We connect Gemini 3.1 Flash Live – the native-audio variant – directly to the same MCP server the text agent uses. There is no transcription proxy, no second LLM, and no separate tool surface for voice. This post walks through why we chose this design, what it costs, and where the rough edges still are.

The naΓ―ve design (and why it loses)

The transcription-proxy pattern looks like this:

mic β†’ STT β†’ transcript β†’ text-LLM β†’ tool-call β†’ MCP β†’ editor
                              ↑
                        (instructions, tool list, history)

It has three real problems. First, latency is the sum of three sequential model calls; you feel the pause every time you finish speaking. Second, the STT step throws away prosody – which is exactly the signal that distinguishes "italic 'really'" from "really, italic". Third, you maintain two divergent prompt surfaces: the text agent's tool docs and the voice agent's, which drift apart over time.

The native-audio design

Gemini Live's native-audio path takes raw audio, runs tool calls against a typed schema, and streams audio back. The diagram collapses to:

mic β†’ Gemini Live β†’ tool-call β†’ MCP β†’ editor
                      ↑
                (same tool list as text agent)

Two consequences fall out:

What we had to build

Three integration points needed work:

Latency: the win we expected and the win we didn't

End-to-end "I stopped speaking β†’ document mutates" latency drops from roughly 2.4s in the proxy design to ~700ms with native audio. The expected win.

The unexpected win is what happens during multi-turn flows. Because we don't pay the STT round-trip per turn, the model can hold a longer context of spoken conversation cheaply. Compound instructions ("make the title bold, then make the next paragraph italic, then export") that previously required us to chunk the transcript into separate LLM calls now run as a single tool-call sequence in one Live session.

What the rough edges look like

Three things still bite:

Why this matters for accessibility

The original motivation for the voice-first build is on the accessibility page: AgentDoc is intended to be usable by people who cannot or do not want to use a mouse and keyboard. The transcription-proxy design always felt wrong for that audience because it imposed a 2-second penalty on every utterance, which compounds badly across a long writing session.

Native audio is not just an engineering preference. It's the difference between voice as a novelty input and voice as a primary one.

What's next

Ongoing work: better reconnect handling, a quota-aware fallback to text-mode when the Live budget is low, and a study comparing voice-only vs. text-only completion times across the 13 benchmark scenarios. That last one is on the research page and will get its own write-up here when the data lands.

← Tool Granularity in LLM Agents All posts β†’