Tool Granularity in LLM Agents: What 20 Workflow Configurations Taught Us

Engineering ยท 9 min read

The folk wisdom around LLM agent tools tends to converge on one of two slogans: "give the model fewer tools so it doesn't get confused", or "expose every operation as a tool so the model can compose freely". Both are wrong in interesting ways. We built AgentDoc โ€“ an MCP-driven docs editor โ€“ partly to have a credible testbed for the question, and we ran 20 workflow configurations (Aโ€“T) across 13 benchmark scenarios to see what actually moves the needle.

This post is a write-up of what we found. The full thesis lives on the research page; here we focus on the practical takeaways for anyone designing tools for LLM agents โ€“ whether for an MCP server, a function-calling API, or a custom orchestrator.

The setup, briefly

Each configuration is a different combination of (a) tool surface โ€“ atomic vs. composite vs. mixed; (b) workflow control โ€“ pure ReAct prompting vs. State-Constrained ReAct FSM; and (c) verification policy โ€“ optional vs. mandatory get_document_context after every mutation. Configurations span from a monolithic zero-shot baseline (one giant edit_document tool, no FSM) all the way to a tightly gated FSM with 35 atomic primitives.

Scoring uses Levenshtein distance on the final document, plus token consumption and explicit hallucination counts (tool calls referencing indices that don't exist).

Finding 1: "fewer tools" wins on tokens, loses on accuracy

The monolithic edit_document baseline is by far the cheapest in tokens โ€“ the model emits one big call and we're done. But its Levenshtein scores are roughly 3ร— worse than the atomic-tool configurations on multi-step tasks, because the model has to plan the entire mutation in a single forward pass with no opportunity to verify intermediate state.

On long documents, this regime fails outright: the model emits an edit referencing a heading that no longer exists at the assumed offset (because earlier hypothetical edits "in the same plan" shifted indices), and the result is a corrupted document that is hard to reason about even for the next agent turn.

Finding 2: "every operation as a tool" wins on accuracy, loses on tokens โ€“ and on cliffs

Pure atomic tools (separate find, delete_substring, insert_string, etc.) push verification into the model's reasoning loop. With instruction-tuned frontier models the accuracy is high โ€“ but token counts balloon, and we observe a sharp cliff around 25-tool surfaces: above that, the model starts mis-selecting between near-synonymous tools (insert_paragraph vs. insert_text vs. append_text), which is its own failure mode that doesn't appear in the 8-tool configurations.

Tool count interacts with naming entropy. Twenty tools whose names cluster semantically (insert_*, append_*, prepend_*) are harder for the model than thirty tools that span obviously distinct verbs.

Finding 3: a small set of well-named macros is the sweet spot

The configurations that ranked highest on the joint accuracy/token score share a structure: atomic primitives for the long tail, plus three to five well-named macros for the obvious common paths (replace_all, format_all_matches, convert_paragraph_to_heading). The macros remove the most common multi-step sequences โ€“ cutting 4 round-trips down to 1 โ€“ without forcing the model to compose them from scratch.

The trick is not the number of macros; it's that each macro must correspond to a natural-language request that users actually make. Macros invented for code elegance ("composeFormatBatch") get ignored by the model in favor of the atomic primitives. Macros named after common requests ("replace all X with Y") get used immediately and correctly.

Finding 4: structural verification beats prompted verification

Telling the model "always re-read the document before mutating" works some of the time. A State-Constrained ReAct FSM that locks the write tools until a get_document_context call completes works all the time. The FSM configurations cut hallucinated-index errors from ~14% in the prompt-only condition to under 2% โ€“ and the cost is a single extra round-trip every few turns, which is cheap relative to a corrupted edit.

This is the strongest practical takeaway in the whole study: if your agent runs against index-sensitive state (a document, a buffer, a syntax tree), do not rely on prompts alone. Make verification structurally inevitable in the tool surface itself.

Finding 5: index drift is a tool-design bug, not a model bug

We saw repeatedly that the worst-performing configurations were not "weak" by any capability metric โ€“ they were just exposing tools whose contracts allowed the model to succeed locally and fail globally. A tool that takes a character index and returns success without telling the model that everything after that index has shifted is the bug. Frontier models will eventually learn to compensate. Your tool surface should not require them to.

Concretely: every mutation tool in AgentDoc returns a structured payload that includes a dirty-range descriptor ({shifted_after: 412, delta: -23}). The next tool call sees that, and the FSM gates a re-read. Hallucinations dropped close to floor as a result.

What this means if you're building an agent

Three rules of thumb worth more than any specific number above:

The next post in this series is on the voice-first architecture โ€“ including why we chose to wire Gemini Live directly to the MCP surface rather than through a transcription proxy. The decision is downstream of exactly the findings above.

โ† Patch Notes โ€“ April 2026 Next: Voice-First Document Editing โ†’