Tool Granularity in LLM Agents: What 20 Workflow Configurations Taught Us
The folk wisdom around LLM agent tools tends to converge on one of two slogans: "give the model fewer tools so it doesn't get confused", or "expose every operation as a tool so the model can compose freely". Both are wrong in interesting ways. We built AgentDoc โ an MCP-driven docs editor โ partly to have a credible testbed for the question, and we ran 20 workflow configurations (AโT) across 13 benchmark scenarios to see what actually moves the needle.
This post is a write-up of what we found. The full thesis lives on the research page; here we focus on the practical takeaways for anyone designing tools for LLM agents โ whether for an MCP server, a function-calling API, or a custom orchestrator.
The setup, briefly
Each configuration is a different combination of (a) tool surface โ atomic vs.
composite vs. mixed; (b) workflow control โ pure ReAct prompting vs. State-Constrained
ReAct FSM; and (c) verification policy โ optional vs. mandatory
get_document_context after every mutation. Configurations span from a
monolithic zero-shot baseline (one giant edit_document tool, no FSM) all the
way to a tightly gated FSM with 35 atomic primitives.
Scoring uses Levenshtein distance on the final document, plus token consumption and explicit hallucination counts (tool calls referencing indices that don't exist).
Finding 1: "fewer tools" wins on tokens, loses on accuracy
The monolithic edit_document baseline is by far the cheapest in tokens โ the
model emits one big call and we're done. But its Levenshtein scores are roughly 3ร worse
than the atomic-tool configurations on multi-step tasks, because the model has to plan the
entire mutation in a single forward pass with no opportunity to verify intermediate state.
On long documents, this regime fails outright: the model emits an edit referencing a heading that no longer exists at the assumed offset (because earlier hypothetical edits "in the same plan" shifted indices), and the result is a corrupted document that is hard to reason about even for the next agent turn.
Finding 2: "every operation as a tool" wins on accuracy, loses on tokens โ and on cliffs
Pure atomic tools (separate find, delete_substring,
insert_string, etc.) push verification into the model's reasoning loop. With
instruction-tuned frontier models the accuracy is high โ but token counts balloon, and we
observe a sharp cliff around 25-tool surfaces: above that, the model starts mis-selecting
between near-synonymous tools (insert_paragraph vs. insert_text vs.
append_text), which is its own failure mode that doesn't appear in the 8-tool
configurations.
Tool count interacts with naming entropy. Twenty tools whose names cluster semantically (insert_*,append_*,prepend_*) are harder for the model than thirty tools that span obviously distinct verbs.
Finding 3: a small set of well-named macros is the sweet spot
The configurations that ranked highest on the joint accuracy/token score share a structure:
atomic primitives for the long tail, plus three to five well-named macros for the obvious
common paths (replace_all, format_all_matches,
convert_paragraph_to_heading). The macros remove the most common multi-step
sequences โ cutting 4 round-trips down to 1 โ without forcing the model to compose them
from scratch.
The trick is not the number of macros; it's that each macro must correspond to a natural-language request that users actually make. Macros invented for code elegance ("composeFormatBatch") get ignored by the model in favor of the atomic primitives. Macros named after common requests ("replace all X with Y") get used immediately and correctly.
Finding 4: structural verification beats prompted verification
Telling the model "always re-read the document before mutating" works some of the time. A
State-Constrained ReAct FSM that locks the write tools until a
get_document_context call completes works all the time. The FSM
configurations cut hallucinated-index errors from ~14% in the prompt-only condition to under
2% โ and the cost is a single extra round-trip every few turns, which is cheap relative to
a corrupted edit.
This is the strongest practical takeaway in the whole study: if your agent runs against index-sensitive state (a document, a buffer, a syntax tree), do not rely on prompts alone. Make verification structurally inevitable in the tool surface itself.
Finding 5: index drift is a tool-design bug, not a model bug
We saw repeatedly that the worst-performing configurations were not "weak" by any capability metric โ they were just exposing tools whose contracts allowed the model to succeed locally and fail globally. A tool that takes a character index and returns success without telling the model that everything after that index has shifted is the bug. Frontier models will eventually learn to compensate. Your tool surface should not require them to.
Concretely: every mutation tool in AgentDoc returns a structured payload that includes a
dirty-range descriptor ({shifted_after: 412, delta: -23}). The next tool call
sees that, and the FSM gates a re-read. Hallucinations dropped close to floor as a result.
What this means if you're building an agent
Three rules of thumb worth more than any specific number above:
- Default to atomic primitives, add macros for natural-language paths. Don't pre-optimise for token cost; it's cheaper than a corrupted edit.
- Keep the tool surface under 25 tools, or split into name-distinct families. Cluster ambiguity is a real failure mode.
- Make verification structural. An FSM that gates write-tools behind a recent re-read costs almost nothing and eliminates the entire class of index-drift bugs.
The next post in this series is on the voice-first architecture โ including why we chose to wire Gemini Live directly to the MCP surface rather than through a transcription proxy. The decision is downstream of exactly the findings above.