What is an agent-first UI?

An agent-first UI is a user interface designed primarily to be driven by an autonomous LLM agent – every operation that exists in the UI is exposed as a typed, agent-callable tool with explicit arguments and structured return values. Humans drive the same UI through chat, voice, or direct manipulation, but the canonical control plane is the tool surface, not the keyboard or mouse.

How does tool granularity affect LLM agent reliability?

Coarser composite tools (e.g. one find_and_replace tool) reduce hallucinated index calculations but can hide failure modes. Finer atomic tools (separate find, delete_substring, insert_string) expose every step but multiply opportunities for index drift. Our benchmark across 20 workflow configurations shows neither extreme wins – atomic primitives plus a small set of well-named macros (replace_all, format_all_matches) gives the best accuracy / token tradeoff.

Why does an agent-first editor need a ReAct FSM?

Pure ReAct prompting tells the agent to verify after every mutation, but does not enforce it. A finite-state machine that gates which tools are unlocked in each phase (SEARCH → MUTATE → VERIFY → SEARCH) makes verification structurally inevitable: write tools auto-lock until a get_document_context call confirms the new state. This eliminates index drift after multi-step edits even when the LLM forgets the verification instruction.

How does Gemini Live integrate with an MCP document editor?

Gemini 3.1 Flash Live (preview, native audio) connects directly to the MCP server with the same tool list as the text agent. Each voice turn becomes a tool-call sequence against typed schemas; the model receives observing feedback (e.g. 'INDEX SHIFT – re-read before next mutation') so it can reason over results without round-tripping to a separate text agent. This eliminates the double-LLM overhead of proxy designs.

Building agent-first UI for document editors

AgentDoc is a working reference implementation of an agent-first document editor: every operation is a typed MCP tool, the canonical control plane is the tool surface, and humans drive it through chat, voice, or direct manipulation on top of the same primitives. It also doubles as a public benchmark – 20 evaluated workflow configurations comparing tool granularity, ReAct FSM gating, macro composition, observing feedback, and tool bloat across 13 scenarios.

If you are designing an agent-first UI, MCP tools, or evaluating LLM agent reliability for document-style tasks, this page is a tour of the design decisions, the failure modes that drove them, and where to read or fork the code.

Connect your agent (MCP, 60s) → See the benchmark results →

What "agent-first UI" means

An agent-first UI is one where every visible operation has a corresponding typed tool with explicit arguments and structured return values, and where the agent is treated as a first-class user – not a bolted-on chat sidebar. Three concrete consequences:

Tool surface is the canonical control plane. The UI buttons call the same tools the agent does. There is no "agent-only" path and no "human-only" path.
State is observable. A read tool (get_document_context) returns both raw Markdown and rendered HTML in one call so the agent and the human see the same thing.
Every mutation is verifiable. Returns include enough metadata (success, range, observable side effects) that the agent can reason over results without a screenshot or DOM scrape.

MCP tool design patterns

Five patterns we ended up converging on after evaluating 20 workflow variants:

Stand-off formatting. The agent never writes raw HTML or inline-style spans. It calls format_text(start, end, format_type, format_value) with semantic tokens like color="blue"; the backend stores formatting as separate [start, end, css_class] triples that survive text edits.
Integer document IDs, not UUIDs. Auto-incrementing integers eliminate character-drop hallucinations when agents copy IDs from one tool call to another.
Invisible token injection. JWT auth tokens and the active doc_id are auto-injected by the MCP wrapper. The LLM never sees security parameters in its tool schema, so it cannot accidentally expose or misformat them.
Atomic primitives + a small macro layer. 21 primitives (find, insert, delete, format) cover any operation. 12 macros (e.g. macro_replace_all, macro_format_all_matches) atomically batch reverse-order operations to prevent index drift on bulk edits. We measured both extremes – pure primitives, all-macros – and the hybrid wins.
Observing feedback over silent enforcement. When indices shift after a mutation, the response includes "observation": "INDEX SHIFT – cached indices are stale". The agent reads it, the human ignores it.

Example – a typed FastMCP tool

@mcp.tool()
async def format_text(
    start_index: int,
    end_index: int,
    format_type: Literal["color", "size", "highlight", "font",
                         "indent", "align", "script", "decoration",
                         "letterspacing", "linespacing", "link"],
    format_value: str,
    target_area: Literal["body", "header", "footer"] = "body",
) -> dict:
    """Apply stand-off formatting to a character range. Index-preserving:
    safe to call repeatedly without re-reading the document."""
    ...
    return {"status": "success", "formatted_range": [start_index, end_index]}

MCP server compatibility

The MCP server at https://agent-doc-edit.com/mcp/sse speaks the standard Model Context Protocol over SSE. Any MCP client can connect – including custom agents built with GPT, Gemini, or open-source models. The built-in voice and text agents use Google Gemini, but the 86 tools are model-agnostic.

Available MCP tools

The backend registers 86 MCP tools organized in six categories. Depending on the workflow configuration, the agent sees between 21 and 83 tools simultaneously.

get_document_context

Returns both raw Markdown and rendered HTML in a single call. The primary read tool – reduces roundtrips versus separate get_text / get_html.

find

Regex-powered, case-insensitive search. Returns all matches with exact [start, end) indices and 150-char context. Supports body, header, and footer areas.

insert_string / delete_substring

Index-based text mutations. Combined with find, these enable any text transformation. Header and footer variants operate on isolated content areas.

format_text

Apply stand-off formatting: 15 colors, 10 highlights, 12 fonts, 7 sizes, bold, italic, underline, strikethrough, sub/superscript, letter & line spacing, links. The agent never writes raw HTML.

format_table

Style Markdown tables with border style/color/width, background colors, text alignment, column widths, cell padding, and row/column striping – all via stand-off metadata.

Macro tools (12)

Composite operations in a single atomic call: macro_replace_styled_text, macro_format_all_matches, macro_replace_all, macro_delete_all, macro_insert_at_all, and more. Process all matches in reverse order to prevent index drift.

Scratchpad tools

plan_tasks, mark_task_done, abort_task – persistent per-session memory with self-conditioning detection. Warns the agent after 2 repeated failures, recommends abort after 3.

Document management

create_document, rename_document, set_active_document, navigate_to_page, set_page_layout, trigger_pdf_download, generate_table_of_contents.

Architecture overview

Containerized microservice stack

Docker Compose: Frontend (Nginx + vanilla JS), Backend (FastAPI + embedded FastMCP on port 8091), Agent (Gemini 3 Flash + MCP Client), PostgreSQL, and Redis. Only port 8080 is exposed – all backend services communicate over an internal Docker bridge network, publicly tunneled via Cloudflare.

Key design decisions

Embedded MCP server: FastMCP runs inside the FastAPI lifespan – zero network hop between tools and database (8–15ms per tool call)
Native Markdown storage: Documents are single Markdown strings, not block hierarchies. HTML is rendered on-the-fly with custom Python-Markdown extensions
Stand-off formatting: Formatting stored as index-based [start, end, css_class] triples, separate from content. The agent uses semantic tokens (e.g. "color blue") that the backend translates to CSS classes
Integer document IDs: Auto-incrementing integers instead of UUIDs – eliminates character-drop hallucinations when agents pass IDs to tools
Invisible token injection: JWT auth tokens and doc_id are auto-injected into MCP tool calls – the LLM never sees security parameters in its schema
Persistent agent memory: Full tool-call history stored in PostgreSQL and replayed as native Gemini Content objects, including thought signatures for Gemini 3.x models
Dual WebSocket sync: A Document WebSocket pushes content updates after every mutation. A User WebSocket handles account-level events (e.g. active document changed by the agent)

Twenty workflow configurations – tool granularity, ReAct FSM, tool bloat

AgentDoc dynamically controls which tools the agent sees and when. Each workflow is a controlled experiment in tool granularity (primitives vs. macros vs. atomic decomposition), ReAct FSM gating (state-machine enforcement vs. instructions-only), and tool bloat (how many tools the model can sustainably reason over). Each workflow is selected via a single string parameter in the chat request. The Agent Scratchpad (plan_tasks, mark_task_done, abort_task) is active in A–K and O; workflows L, M, N, P, Q, R, S, T disable it. Workflow P is the production default for both the text agent and the voice agent.

Workflow A (24 tools) – Static baseline: all primitive tools, no restrictions, neutral scratchpad
Workflow B (83 tools) – Tool Bloat benchmark: monolithic format_text replaced with 46 atomic endpoints + 12 macros, expanded literal enums
Workflow C (27 tools, 10 initial) – Active Discovery: agent starts with read-only tools and must call discover_tools to unlock mutations
Workflow D (26 tools, 6 initial) – Enforced Verification loop: write tools auto-lock after every mutation, forcing a verification read
Workflow E (38 tools) – Macro composite tools: primitives + 12 macros including bulk operations
Workflow F (27 tools, 7 initial) – Finite-State Machine: 8-state FSM controlling tool access through explicit state transitions (SEARCH → MUTATE → VERIFY)
Workflow G (39 tools, 7 initial) – FSM + Macros: combines F's state machine with E's macros. Atomic macro calls skip the MUTATE state
Workflow H (25 tools) – Instructed ReAct: same tools as A, plus an explicit verification instruction in the system prompt. No infrastructure enforcement
Workflow I (38 tools) – Instructed ReAct + Macros: combines H's prompt discipline with E's macros
Workflow J (25 tools) – Observing FSM: state feedback after every action, no tool blocking. Tests whether feedback alone produces verification discipline
Workflow K (38 tools) – Observing FSM + Macros: J's feedback plus E's macros
Workflow L (21 tools) – Bare primitives: no scratchpad, no feedback, no verification. Minimal-baseline isolating the scratchpad effect
Workflow M (35 tools) – Observing FSM + Macros without scratchpad. Pareto-optimal at 100% accuracy and ~66k tokens / step
Workflow N (35 tools) – Macros only, no scratchpad and no feedback. Isolates the value of feedback when macros are present
Workflow O (25 tools) – Few-Shot ReAct: H's tools, but verification taught via a few-shot example trajectory (closest to original Yao et al. 2023 ReAct)
Workflow P (35 tools) – Instructed ReAct + Macros without scratchpad. Production default for the text and voice agent: 100% accuracy at ~72k tokens / step on Bench 12
Workflow Q (22 tools) – Instructed ReAct + Primitives without scratchpad. Completes the 2×2 scratchpad-isolation matrix
Workflow R (~35 tools, 7 initial) – Optimized FSM: F's enforcement plus macros, batch mutations, and intent inference. Skips the IDLE state
Workflow S (~35 tools, 7 initial) – Smart Verification FSM: index-shifting operations (insert_string, delete_substring, replace_substring) force a verify; index-preserving formatting does not
Workflow T (~35 tools, 7 initial) – Smart Observing FSM: like S, but feedback only ("INDEX SHIFT – cached indices are stale"), no blocking

Gemini Live + MCP voice agent – direct tool calling, no proxy

The Voice Agent runs Gemini 3.1 Flash Live (Preview, native audio) and connects directly to the FastMCP server – the same 35 tools as Workflow P, with the same system prompt. Each tool call returns observing feedback (e.g. "INDEX SHIFT – re-read before next mutation") so the voice model can reason over results without a hop to a separate text agent. This replaces the earlier pure-proxy design (send_to_text_agent via HTTP roundtrip) and eliminates the double-LLM overhead, while still producing the same edits as the text chat. VAD sensitivity is tuned to prevent accidental interruptions during long agent responses.

Common questions developers ask

Why does my LLM agent hallucinate tool calls?

Three reliable causes in our benchmark: (1) tool bloat – past ~40 visible tools, accuracy degrades sharply on most models even when individual tools are well-named; (2) ambiguous tool names – update_text and edit_text both visible at the same time invite invented signatures; (3) missing post-mutation feedback – the agent assumes its previous read is still valid and fabricates indices into a now-shifted document. Fixes: cap visible tools, use one canonical verb per operation, and either FSM-gate verification or surface index-shift observations in tool returns.

How do I prevent index drift in agent text edits?

Index drift happens when an agent caches start, end from a find call, then performs an insert_string earlier in the document – invalidating every subsequent cached index. Three fixes, in increasing infrastructure cost: (1) instruct the agent to re-find before every mutation (works ~85% of the time); (2) emit observing feedback in tool returns so the agent sees when shift occurred; (3) FSM-gate write tools so they auto-lock after every mutation, forcing a verification read before the next edit. Workflows S and T in our benchmark only force re-verification on index-shifting operations, leaving format-only edits unblocked – Pareto-optimal in our measurements.

Is there an open benchmark for agent document editing?

Yes – AgentDoc's harness includes 13 scenarios with deterministic expected-state assertions, runs your agent across the 20 workflow configurations, and reports Levenshtein distance, character-diff, tool-call counts, token usage, and per-step latency. Output is CSV plus per-run JSON traces for qualitative inspection. Useful for evaluating new MCP tool designs, comparing models, or reproducing tool-bloat thresholds on your own stack.

How does this compare to Cursor or Copilot?

Cursor and Copilot are agent-first code editors. AgentDoc applies the same architecture to prose documents – paginated A4 pages, headers/footers, formal letter formatting, table styling, PDF export. The agent-first patterns transfer (typed tools, FSM gating, observing feedback) but the tool surface differs: instead of open_file + edit_file, the primitives are find + insert_string + format_text + generate_table_of_contents.

Get started

The MCP server is reachable at https://agent-doc-edit.com/mcp/sse via the Nginx reverse proxy. To connect your own agent, point your MCP client at that endpoint.

The benchmark harness, evaluation scripts, and full source for the 20 workflow configurations are documented in the accompanying thesis (linked from /research). Interested in forking, citing, or contributing? Reach out directly:

Contact via Email Read the research →