Building agent-first UI for document editors

AgentDoc is a working reference implementation of an agent-first document editor: every operation is a typed MCP tool, the canonical control plane is the tool surface, and humans drive it through chat, voice, or direct manipulation on top of the same primitives. It also doubles as a public benchmark – 20 evaluated workflow configurations comparing tool granularity, ReAct FSM gating, macro composition, observing feedback, and tool bloat across 13 scenarios.

If you are designing an agent-first UI, MCP tools, or evaluating LLM agent reliability for document-style tasks, this page is a tour of the design decisions, the failure modes that drove them, and where to read or fork the code.

Connect your agent (MCP, 60s) → See the benchmark results →

What "agent-first UI" means

An agent-first UI is one where every visible operation has a corresponding typed tool with explicit arguments and structured return values, and where the agent is treated as a first-class user – not a bolted-on chat sidebar. Three concrete consequences:

MCP tool design patterns

Five patterns we ended up converging on after evaluating 20 workflow variants:

Example – a typed FastMCP tool

@mcp.tool()
async def format_text(
    start_index: int,
    end_index: int,
    format_type: Literal["color", "size", "highlight", "font",
                         "indent", "align", "script", "decoration",
                         "letterspacing", "linespacing", "link"],
    format_value: str,
    target_area: Literal["body", "header", "footer"] = "body",
) -> dict:
    """Apply stand-off formatting to a character range. Index-preserving:
    safe to call repeatedly without re-reading the document."""
    ...
    return {"status": "success", "formatted_range": [start_index, end_index]}

MCP server compatibility

The MCP server at https://agent-doc-edit.com/mcp/sse speaks the standard Model Context Protocol over SSE. Any MCP client can connect – including custom agents built with GPT, Gemini, or open-source models. The built-in voice and text agents use Google Gemini, but the 86 tools are model-agnostic.

Available MCP tools

The backend registers 86 MCP tools organized in six categories. Depending on the workflow configuration, the agent sees between 21 and 83 tools simultaneously.

get_document_context

Returns both raw Markdown and rendered HTML in a single call. The primary read tool – reduces roundtrips versus separate get_text / get_html.

find

Regex-powered, case-insensitive search. Returns all matches with exact [start, end) indices and 150-char context. Supports body, header, and footer areas.

insert_string / delete_substring

Index-based text mutations. Combined with find, these enable any text transformation. Header and footer variants operate on isolated content areas.

format_text

Apply stand-off formatting: 15 colors, 10 highlights, 12 fonts, 7 sizes, bold, italic, underline, strikethrough, sub/superscript, letter & line spacing, links. The agent never writes raw HTML.

format_table

Style Markdown tables with border style/color/width, background colors, text alignment, column widths, cell padding, and row/column striping – all via stand-off metadata.

Macro tools (12)

Composite operations in a single atomic call: macro_replace_styled_text, macro_format_all_matches, macro_replace_all, macro_delete_all, macro_insert_at_all, and more. Process all matches in reverse order to prevent index drift.

Scratchpad tools

plan_tasks, mark_task_done, abort_task – persistent per-session memory with self-conditioning detection. Warns the agent after 2 repeated failures, recommends abort after 3.

Document management

create_document, rename_document, set_active_document, navigate_to_page, set_page_layout, trigger_pdf_download, generate_table_of_contents.

Architecture overview

Containerized microservice stack

Docker Compose: Frontend (Nginx + vanilla JS), Backend (FastAPI + embedded FastMCP on port 8091), Agent (Gemini 3 Flash + MCP Client), PostgreSQL, and Redis. Only port 8080 is exposed – all backend services communicate over an internal Docker bridge network, publicly tunneled via Cloudflare.

Key design decisions

Twenty workflow configurations – tool granularity, ReAct FSM, tool bloat

AgentDoc dynamically controls which tools the agent sees and when. Each workflow is a controlled experiment in tool granularity (primitives vs. macros vs. atomic decomposition), ReAct FSM gating (state-machine enforcement vs. instructions-only), and tool bloat (how many tools the model can sustainably reason over). Each workflow is selected via a single string parameter in the chat request. The Agent Scratchpad (plan_tasks, mark_task_done, abort_task) is active in A–K and O; workflows L, M, N, P, Q, R, S, T disable it. Workflow P is the production default for both the text agent and the voice agent.

Gemini Live + MCP voice agent – direct tool calling, no proxy

The Voice Agent runs Gemini 3.1 Flash Live (Preview, native audio) and connects directly to the FastMCP server – the same 35 tools as Workflow P, with the same system prompt. Each tool call returns observing feedback (e.g. "INDEX SHIFT – re-read before next mutation") so the voice model can reason over results without a hop to a separate text agent. This replaces the earlier pure-proxy design (send_to_text_agent via HTTP roundtrip) and eliminates the double-LLM overhead, while still producing the same edits as the text chat. VAD sensitivity is tuned to prevent accidental interruptions during long agent responses.

Common questions developers ask

Why does my LLM agent hallucinate tool calls?

Three reliable causes in our benchmark: (1) tool bloat – past ~40 visible tools, accuracy degrades sharply on most models even when individual tools are well-named; (2) ambiguous tool namesupdate_text and edit_text both visible at the same time invite invented signatures; (3) missing post-mutation feedback – the agent assumes its previous read is still valid and fabricates indices into a now-shifted document. Fixes: cap visible tools, use one canonical verb per operation, and either FSM-gate verification or surface index-shift observations in tool returns.

How do I prevent index drift in agent text edits?

Index drift happens when an agent caches start, end from a find call, then performs an insert_string earlier in the document – invalidating every subsequent cached index. Three fixes, in increasing infrastructure cost: (1) instruct the agent to re-find before every mutation (works ~85% of the time); (2) emit observing feedback in tool returns so the agent sees when shift occurred; (3) FSM-gate write tools so they auto-lock after every mutation, forcing a verification read before the next edit. Workflows S and T in our benchmark only force re-verification on index-shifting operations, leaving format-only edits unblocked – Pareto-optimal in our measurements.

Is there an open benchmark for agent document editing?

Yes – AgentDoc's harness includes 13 scenarios with deterministic expected-state assertions, runs your agent across the 20 workflow configurations, and reports Levenshtein distance, character-diff, tool-call counts, token usage, and per-step latency. Output is CSV plus per-run JSON traces for qualitative inspection. Useful for evaluating new MCP tool designs, comparing models, or reproducing tool-bloat thresholds on your own stack.

How does this compare to Cursor or Copilot?

Cursor and Copilot are agent-first code editors. AgentDoc applies the same architecture to prose documents – paginated A4 pages, headers/footers, formal letter formatting, table styling, PDF export. The agent-first patterns transfer (typed tools, FSM gating, observing feedback) but the tool surface differs: instead of open_file + edit_file, the primitives are find + insert_string + format_text + generate_table_of_contents.

Get started

The MCP server is reachable at https://agent-doc-edit.com/mcp/sse via the Nginx reverse proxy. To connect your own agent, point your MCP client at that endpoint.

The benchmark harness, evaluation scripts, and full source for the 20 workflow configurations are documented in the accompanying thesis (linked from /research). Interested in forking, citing, or contributing? Reach out directly:

Contact via Email Read the research →

Further reading