How tool design changes LLM agent reliability — 20 workflows benchmarked
An open empirical study on autonomous document editing with Google Gemini 3 Flash. The Pareto-optimal tool design matches a 24-tool baseline at 100% pass rate while cutting tokens by ~54% and tool calls by ~60%. An 83-tool “bloated” workflow collapses to 0% on the hardest scenarios — not from Lost-in-the-Middle, but from parameter confusion across inflated literal enums.
Below: research questions, the 20 workflow configurations, per-step metrics across 13 benchmark scenarios, and the design rationale behind the production default (P / S).
Jump to results ↓ Read the write-up →Context
This benchmark is the testbed for a master’s thesis at HFT Stuttgart: Agent-Driven Voice-Only Interfaces — Evaluation of Tool-Granularity, Tool-Bloat and Workflow-Constraints on an Agent-First Document Editor. Data is being re-collected and published openly as the workflow set evolves.
Research questions
- FF1 – Agent-First Architecture: What architectural requirements must a document editor fulfill to support reliable autonomous agent operation?
- FF2 – Tool Bloat: How does the number of simultaneously available tools in the context window affect execution accuracy, and can dynamic tool restriction (Active Discovery) reduce the error rate?
- FF3 – Workflow Restrictions: How effectively do architecturally enforced verification loops (ReAct, FSM) reduce error rates compared to unregulated environments?
- FF4 – Tool Granularity: How does the choice between atomic primitives, composite macros, and bulk operations affect execution accuracy and latency?
Evaluation framework
The benchmarking engine (benchmark.py) autonomously runs tasks across all workflow configurations using Google Gemini 3 Flash as the LLM backend. For each step, it provisions a fresh document, sends a prompt to the agent, and compares the resulting HTML against a reference using Levenshtein distance after DOM normalization (BeautifulSoup4). Full tool-call traces are exported as JSON for qualitative analysis.
Error classification (Lu et al., 2025)
Every error is categorized into one of three types:
- Planning errors: Agent selected a locked tool or aborted – detected via
SYSTEM INTERCEPTresponses andabort_taskcalls - Execution errors: Agent selected the correct tool but passed invalid parameters – detected via backend validation errors
- Response generation errors: All tool calls succeeded, but the final document deviates from the reference – the agent misunderstood the task
Workflow configurations
The Agent Scratchpad (persistent memory with self-conditioning detection, three tools plan_tasks / mark_task_done / abort_task) is active in workflows A–K and O. Workflows L, M, N, P, Q, R, S, and T disable it entirely — promoting the scratchpad from controlled constant to experimental variable along the 2×2 isolation matrix (verification × granularity).
| Workflow | Tools (exposed / initial) | Mechanism | Tests |
|---|---|---|---|
| A Baseline | 24 / 24 | All primitive tools, no restrictions, neutral scratchpad | Unregulated baseline |
| B Bloat | 83 / 83 | format_text replaced with 46 atomic endpoints + 12 macros, expanded literal enums | Schema bloat, literal-enum confusion |
| C Discovery | 27 / 10 | On-demand discover_tools unlocks mutations | Active tool acquisition (MCP-Zero) |
| D Enforced Verification | 26 / 6 | Write tools auto-lock after every mutation | Infrastructure-enforced verification loop |
| E Macros | 38 / 38 | Primitives + 12 composite macros incl. bulk ops | Tool granularity tradeoff |
| F FSM | 27 / 7 | 8-state finite-state machine (SEARCH → MUTATE → VERIFY) | State-constrained ReAct |
| G FSM + Macros | 39 / 7 | FSM + macros (atomic calls skip MUTATE) | Granularity × restriction interaction |
| H Instructed ReAct | 25 / 25 | Prompt-based verification instruction, no enforcement | Prompt discipline vs. infrastructure |
| I Instructed ReAct + Macros | 38 / 38 | Instructed ReAct + 12 composite macros | Prompt discipline with macro shortcuts |
| J Observing FSM | 25 / 25 | State feedback after every action, no blocking | Feedback alone vs. enforcement |
| K Observing FSM + Macros | 38 / 38 | J's feedback + E's macros | Feedback × macro interaction |
| L Bare Primitives | 21 / 21 | Primitives only, no scratchpad, no feedback, no verification | Minimal baseline (scratchpad isolation) |
| M Observing FSM + Macros, no Scratchpad | 35 / 35 | K without scratchpad | Pareto-optimal 100% workflow |
| N Macros, no Scratchpad | 35 / 35 | Macros without scratchpad and without feedback | Isolates feedback effect (L vs. M) |
| O Few-Shot ReAct | 25 / 25 | H's tools, but verification taught via few-shot example | Closest to original Yao et al. 2023 ReAct |
| P Instructed ReAct + Macros, no Scratchpad Default | 35 / 35 | I without scratchpad | Production default for text + voice agent |
| Q Instructed ReAct + Primitives, no Scratchpad | 22 / 22 | H without scratchpad | Completes 2×2 scratchpad-isolation matrix |
| R Optimized FSM, no Scratchpad | ~35 / 7 | FSM + macros + batch mutations + intent inference, no scratchpad | FSM roundtrip optimization |
| S Smart Verification FSM, no Scratchpad | ~35 / 7 | Selective enforcement: index-shifting ops force verify, formatting does not | Differentiated verification policy |
| T Smart Observing FSM, no Scratchpad | ~35 / 7 | Like S, but feedback only ("INDEX SHIFT — cached indices stale"), no blocking | Feedback vs. enforcement under index drift |
Key metrics
- Pass rate – percentage of benchmark steps with Levenshtein distance zero (exact HTML match)
- Levenshtein distance – character-level edit distance between agent output and reference HTML (after DOM normalization)
- Relative success – normalized against BASELINE control group (0% = no change, 100% = perfect match)
- Planning / Execution / Response error counts – per Lu et al.'s taxonomy, including Context and Selection Hallucination sub-types
- Token consumption – total prompt + candidate tokens per step, with peak and average context window sizes
- Latency decomposition – total time split into Gemini inference (99.5%), MCP tool execution (0.4%), and orchestration overhead (0.1%)
- Tool calls per step – number of LLM roundtrips, correlated with workflow restrictions
- Context Hallucination rate – how often the agent attempted locked tools in dynamically restricted workflows
Theoretical foundations
The research builds on:
- ReAct (Yao et al., 2023) – Thought-Action-Observation loops as baseline reasoning
- StateAct (Rozanov & Rei, 2025) – Self-prompting and chain-of-states tracking
- Lost in the Middle (Liu et al., 2024) – U-shaped attention distribution in long contexts
- Self-Conditioning Effect (Sinha et al., 2026) – Error cascades from in-context conditioning
- MCP-Zero (Fei et al., 2025) – Active tool discovery reducing token overhead by 98%
- Tool hallucination taxonomy (Xu et al., 2025) – Selection vs. usage hallucinations, indecisive action space
- Lu et al. (2025) – Three-category agent error taxonomy: planning, execution, response generation
Benchmark results summary
Per-step averages across 13 benchmark scenarios. Workflow restrictions (Bench 12) is the primary cross-workflow stress test; tool-bloat workflows B and C are evaluated on the dedicated bloat benchmarks (Bench 10/11); endurance benchmark (Bench 13, 10-step risk report) is used to differentiate the high-performing workflows. Model: Google Gemini 3 Flash (Preview).
| Workflow | Pass Rate | Avg Tokens / step | Avg Tool Calls / step | Bench |
|---|---|---|---|---|
| A Static Baseline | 100% | 142,657 | 14.0 | 12 |
| B Tool Bloat | 45% | 360,372 | 17.8 | 10/11 |
| C Active Discovery | 80% | 333,368 | 20.5 | 10/11 |
| D Enforced Verification | 100% | 460,424 | 21.9 | 12 |
| E Macros | 94% | 98,238 | 9.4 | 12 |
| F FSM | 87% | 583,786 | 24.4 | 12 |
| G FSM + Macros | 90% | 316,846 | 16.6 | 12 |
| H Instructed ReAct | 100% | 272,537 | 18.9 | 12 |
| I Instructed ReAct + Macros | 100% | 182,967 | 13.1 | 12 |
| J Observing FSM | 100% | 174,106 | 15.2 | 12 |
| K Observing FSM + Macros | 100% | 156,221 | 13.3 | 12 |
| L Bare Primitives | 60% | 78,076 | 9.5 | 12 |
| M Observing FSM + Macros, no SP Pareto | 100% | 65,907 | 5.6 | 12 |
| N Macros, no SP / no feedback | 80% | 102,367 | 9.6 | 12 |
| O Few-Shot ReAct | 100% | 153,225 | 14.6 | 12 |
| P Instructed ReAct + Macros, no SP Default | 100% | 71,987 | 5.3 | 12 |
| Q Instructed ReAct + Primitives, no SP | 92% | 146,440 | 12.9 | 12 |
| R Optimized FSM, no SP | 80% | 93,866 | 7.5 | 12 |
| S Smart Verification FSM, no SP | 100% | 64,561 | 5.6 | 12 |
| T Smart Observing FSM, no SP | 92% | 63,304 | 5.6 | 12 |
Key findings. Tool Bloat (B, 83 tools) collapses to 0% on Bench 11 — the failure mode is not Lost-in-the-Middle on the tool list but parameter confusion across inflated literal enums (e.g., gold vs. yellow). Three verification mechanisms reach 100% on Bench 12: enforced gating (D), prompt instruction (H), and observing feedback (J) — ranked from most to least token-expensive. Removing the Agent Scratchpad on the macro workflows (M, P, S) is essentially free in accuracy but cuts per-step tokens to ~65k and tool calls to ~5. P (Instructed ReAct + Macros, no scratchpad) is deployed as the production default for both the text and voice agent — it Pareto-dominates I across Bench 12 and Bench 13 (Endurance) at 100% pass rate / ~73k tokens / 4.9 calls per step. Smart-Verification FSM (S) replicates this Pareto point with infrastructural index-drift guarantees, while its observing-only sibling (T) drops to 92% — selective enforcement matters where the model mis-computes its own insert lengths.
Contact & source code
If you're working on LLM agent evaluation, tool use benchmarking, or MCP tooling, I'd love to connect.
Contact via EmailCompanion blog posts
- Tool Granularity in LLM Agents — an accessible write-up of what 20 workflow configurations × 13 scenarios revealed about atomic vs. composite MCP tool design.
- Voice-First Document Editing with Gemini Live — native-audio architecture for the voice agent benchmarked alongside the text agent.
- Rebuilding PDF + DOCX Export — the export pipeline used to materialise benchmark documents.