Building agent-first UI for document editors
AgentDoc is a working reference implementation of an agent-first document editor: every operation is a typed MCP tool, the canonical control plane is the tool surface, and humans drive it through chat, voice, or direct manipulation on top of the same primitives. It also doubles as a public benchmark – 20 evaluated workflow configurations comparing tool granularity, ReAct FSM gating, macro composition, observing feedback, and tool bloat across 13 scenarios.
If you are designing an agent-first UI, MCP tools, or evaluating LLM agent reliability for document-style tasks, this page is a tour of the design decisions, the failure modes that drove them, and where to read or fork the code.
Connect your agent (MCP, 60s) → See the benchmark results →What "agent-first UI" means
An agent-first UI is one where every visible operation has a corresponding typed tool with explicit arguments and structured return values, and where the agent is treated as a first-class user – not a bolted-on chat sidebar. Three concrete consequences:
- Tool surface is the canonical control plane. The UI buttons call the same tools the agent does. There is no "agent-only" path and no "human-only" path.
- State is observable. A read tool (
get_document_context) returns both raw Markdown and rendered HTML in one call so the agent and the human see the same thing. - Every mutation is verifiable. Returns include enough metadata (success, range, observable side effects) that the agent can reason over results without a screenshot or DOM scrape.
MCP tool design patterns
Five patterns we ended up converging on after evaluating 20 workflow variants:
- Stand-off formatting. The agent never writes raw HTML or inline-style spans. It calls
format_text(start, end, format_type, format_value)with semantic tokens likecolor="blue"; the backend stores formatting as separate[start, end, css_class]triples that survive text edits. - Integer document IDs, not UUIDs. Auto-incrementing integers eliminate character-drop hallucinations when agents copy IDs from one tool call to another.
- Invisible token injection. JWT auth tokens and the active
doc_idare auto-injected by the MCP wrapper. The LLM never sees security parameters in its tool schema, so it cannot accidentally expose or misformat them. - Atomic primitives + a small macro layer. 21 primitives (find, insert, delete, format) cover any operation. 12 macros (e.g.
macro_replace_all,macro_format_all_matches) atomically batch reverse-order operations to prevent index drift on bulk edits. We measured both extremes – pure primitives, all-macros – and the hybrid wins. - Observing feedback over silent enforcement. When indices shift after a mutation, the response includes
"observation": "INDEX SHIFT – cached indices are stale". The agent reads it, the human ignores it.
Example – a typed FastMCP tool
@mcp.tool()
async def format_text(
start_index: int,
end_index: int,
format_type: Literal["color", "size", "highlight", "font",
"indent", "align", "script", "decoration",
"letterspacing", "linespacing", "link"],
format_value: str,
target_area: Literal["body", "header", "footer"] = "body",
) -> dict:
"""Apply stand-off formatting to a character range. Index-preserving:
safe to call repeatedly without re-reading the document."""
...
return {"status": "success", "formatted_range": [start_index, end_index]}
MCP server compatibility
The MCP server at https://agent-doc-edit.com/mcp/sse speaks the standard Model Context Protocol over SSE. Any MCP client can connect – including custom agents built with GPT, Gemini, or open-source models. The built-in voice and text agents use Google Gemini, but the 86 tools are model-agnostic.
Available MCP tools
The backend registers 86 MCP tools organized in six categories. Depending on the workflow configuration, the agent sees between 21 and 83 tools simultaneously.
get_document_context
Returns both raw Markdown and rendered HTML in a single call. The primary read tool – reduces roundtrips versus separate get_text / get_html.
find
Regex-powered, case-insensitive search. Returns all matches with exact [start, end) indices and 150-char context. Supports body, header, and footer areas.
insert_string / delete_substring
Index-based text mutations. Combined with find, these enable any text transformation. Header and footer variants operate on isolated content areas.
format_text
Apply stand-off formatting: 15 colors, 10 highlights, 12 fonts, 7 sizes, bold, italic, underline, strikethrough, sub/superscript, letter & line spacing, links. The agent never writes raw HTML.
format_table
Style Markdown tables with border style/color/width, background colors, text alignment, column widths, cell padding, and row/column striping – all via stand-off metadata.
Macro tools (12)
Composite operations in a single atomic call: macro_replace_styled_text, macro_format_all_matches, macro_replace_all, macro_delete_all, macro_insert_at_all, and more. Process all matches in reverse order to prevent index drift.
Scratchpad tools
plan_tasks, mark_task_done, abort_task – persistent per-session memory with self-conditioning detection. Warns the agent after 2 repeated failures, recommends abort after 3.
Document management
create_document, rename_document, set_active_document, navigate_to_page, set_page_layout, trigger_pdf_download, generate_table_of_contents.
Architecture overview
Containerized microservice stack
Docker Compose: Frontend (Nginx + vanilla JS), Backend (FastAPI + embedded FastMCP on port 8091), Agent (Gemini 3 Flash + MCP Client), PostgreSQL, and Redis. Only port 8080 is exposed – all backend services communicate over an internal Docker bridge network, publicly tunneled via Cloudflare.
Key design decisions
- Embedded MCP server: FastMCP runs inside the FastAPI lifespan – zero network hop between tools and database (8–15ms per tool call)
- Native Markdown storage: Documents are single Markdown strings, not block hierarchies. HTML is rendered on-the-fly with custom Python-Markdown extensions
- Stand-off formatting: Formatting stored as index-based
[start, end, css_class]triples, separate from content. The agent uses semantic tokens (e.g. "color blue") that the backend translates to CSS classes - Integer document IDs: Auto-incrementing integers instead of UUIDs – eliminates character-drop hallucinations when agents pass IDs to tools
- Invisible token injection: JWT auth tokens and
doc_idare auto-injected into MCP tool calls – the LLM never sees security parameters in its schema - Persistent agent memory: Full tool-call history stored in PostgreSQL and replayed as native Gemini Content objects, including thought signatures for Gemini 3.x models
- Dual WebSocket sync: A Document WebSocket pushes content updates after every mutation. A User WebSocket handles account-level events (e.g. active document changed by the agent)
Twenty workflow configurations – tool granularity, ReAct FSM, tool bloat
AgentDoc dynamically controls which tools the agent sees and when. Each workflow is a controlled experiment in tool granularity (primitives vs. macros vs. atomic decomposition), ReAct FSM gating (state-machine enforcement vs. instructions-only), and tool bloat (how many tools the model can sustainably reason over). Each workflow is selected via a single string parameter in the chat request. The Agent Scratchpad (plan_tasks, mark_task_done, abort_task) is active in A–K and O; workflows L, M, N, P, Q, R, S, T disable it. Workflow P is the production default for both the text agent and the voice agent.
- Workflow A (24 tools) – Static baseline: all primitive tools, no restrictions, neutral scratchpad
- Workflow B (83 tools) – Tool Bloat benchmark: monolithic
format_textreplaced with 46 atomic endpoints + 12 macros, expanded literal enums - Workflow C (27 tools, 10 initial) – Active Discovery: agent starts with read-only tools and must call
discover_toolsto unlock mutations - Workflow D (26 tools, 6 initial) – Enforced Verification loop: write tools auto-lock after every mutation, forcing a verification read
- Workflow E (38 tools) – Macro composite tools: primitives + 12 macros including bulk operations
- Workflow F (27 tools, 7 initial) – Finite-State Machine: 8-state FSM controlling tool access through explicit state transitions (SEARCH → MUTATE → VERIFY)
- Workflow G (39 tools, 7 initial) – FSM + Macros: combines F's state machine with E's macros. Atomic macro calls skip the MUTATE state
- Workflow H (25 tools) – Instructed ReAct: same tools as A, plus an explicit verification instruction in the system prompt. No infrastructure enforcement
- Workflow I (38 tools) – Instructed ReAct + Macros: combines H's prompt discipline with E's macros
- Workflow J (25 tools) – Observing FSM: state feedback after every action, no tool blocking. Tests whether feedback alone produces verification discipline
- Workflow K (38 tools) – Observing FSM + Macros: J's feedback plus E's macros
- Workflow L (21 tools) – Bare primitives: no scratchpad, no feedback, no verification. Minimal-baseline isolating the scratchpad effect
- Workflow M (35 tools) – Observing FSM + Macros without scratchpad. Pareto-optimal at 100% accuracy and ~66k tokens / step
- Workflow N (35 tools) – Macros only, no scratchpad and no feedback. Isolates the value of feedback when macros are present
- Workflow O (25 tools) – Few-Shot ReAct: H's tools, but verification taught via a few-shot example trajectory (closest to original Yao et al. 2023 ReAct)
- Workflow P (35 tools) – Instructed ReAct + Macros without scratchpad. Production default for the text and voice agent: 100% accuracy at ~72k tokens / step on Bench 12
- Workflow Q (22 tools) – Instructed ReAct + Primitives without scratchpad. Completes the 2×2 scratchpad-isolation matrix
- Workflow R (~35 tools, 7 initial) – Optimized FSM: F's enforcement plus macros, batch mutations, and intent inference. Skips the IDLE state
- Workflow S (~35 tools, 7 initial) – Smart Verification FSM: index-shifting operations (
insert_string,delete_substring,replace_substring) force a verify; index-preserving formatting does not - Workflow T (~35 tools, 7 initial) – Smart Observing FSM: like S, but feedback only ("INDEX SHIFT – cached indices are stale"), no blocking
Gemini Live + MCP voice agent – direct tool calling, no proxy
The Voice Agent runs Gemini 3.1 Flash Live (Preview, native audio) and connects directly to the FastMCP server – the same 35 tools as Workflow P, with the same system prompt. Each tool call returns observing feedback (e.g. "INDEX SHIFT – re-read before next mutation") so the voice model can reason over results without a hop to a separate text agent. This replaces the earlier pure-proxy design (send_to_text_agent via HTTP roundtrip) and eliminates the double-LLM overhead, while still producing the same edits as the text chat. VAD sensitivity is tuned to prevent accidental interruptions during long agent responses.
Common questions developers ask
Why does my LLM agent hallucinate tool calls?
Three reliable causes in our benchmark: (1) tool bloat – past ~40 visible tools, accuracy degrades sharply on most models even when individual tools are well-named; (2) ambiguous tool names – update_text and edit_text both visible at the same time invite invented signatures; (3) missing post-mutation feedback – the agent assumes its previous read is still valid and fabricates indices into a now-shifted document. Fixes: cap visible tools, use one canonical verb per operation, and either FSM-gate verification or surface index-shift observations in tool returns.
How do I prevent index drift in agent text edits?
Index drift happens when an agent caches start, end from a find call, then performs an insert_string earlier in the document – invalidating every subsequent cached index. Three fixes, in increasing infrastructure cost: (1) instruct the agent to re-find before every mutation (works ~85% of the time); (2) emit observing feedback in tool returns so the agent sees when shift occurred; (3) FSM-gate write tools so they auto-lock after every mutation, forcing a verification read before the next edit. Workflows S and T in our benchmark only force re-verification on index-shifting operations, leaving format-only edits unblocked – Pareto-optimal in our measurements.
Is there an open benchmark for agent document editing?
Yes – AgentDoc's harness includes 13 scenarios with deterministic expected-state assertions, runs your agent across the 20 workflow configurations, and reports Levenshtein distance, character-diff, tool-call counts, token usage, and per-step latency. Output is CSV plus per-run JSON traces for qualitative inspection. Useful for evaluating new MCP tool designs, comparing models, or reproducing tool-bloat thresholds on your own stack.
How does this compare to Cursor or Copilot?
Cursor and Copilot are agent-first code editors. AgentDoc applies the same architecture to prose documents – paginated A4 pages, headers/footers, formal letter formatting, table styling, PDF export. The agent-first patterns transfer (typed tools, FSM gating, observing feedback) but the tool surface differs: instead of open_file + edit_file, the primitives are find + insert_string + format_text + generate_table_of_contents.
Get started
The MCP server is reachable at https://agent-doc-edit.com/mcp/sse via the Nginx reverse proxy. To connect your own agent, point your MCP client at that endpoint.
The benchmark harness, evaluation scripts, and full source for the 20 workflow configurations are documented in the accompanying thesis (linked from /research). Interested in forking, citing, or contributing? Reach out directly:
Contact via Email Read the research →Further reading
- Tool Granularity in LLM Agents – findings from 20 workflow configurations × 13 scenarios on atomic vs. composite MCP tool design.
- Rebuilding PDF + DOCX Export – WeasyPrint, Pandoc Lua filters, and trusting the frontend's pagination map.
- Voice-First Document Editing with Gemini Live – architecture and tradeoffs of native-audio MCP tool calling.
- April 2026 Patch Notes – renderer fixes, decoration-class composition, TOGGLE semantics on apply_* tools.