How tool design changes LLM agent reliability — 20 workflows benchmarked

An open empirical study on autonomous document editing with Google Gemini 3 Flash. The Pareto-optimal tool design matches a 24-tool baseline at 100% pass rate while cutting tokens by ~54% and tool calls by ~60%. An 83-tool “bloated” workflow collapses to 0% on the hardest scenarios — not from Lost-in-the-Middle, but from parameter confusion across inflated literal enums.

Below: research questions, the 20 workflow configurations, per-step metrics across 13 benchmark scenarios, and the design rationale behind the production default (P / S).

Jump to results ↓ Read the write-up →

Context

This benchmark is the testbed for a master’s thesis at HFT Stuttgart: Agent-Driven Voice-Only Interfaces — Evaluation of Tool-Granularity, Tool-Bloat and Workflow-Constraints on an Agent-First Document Editor. Data is being re-collected and published openly as the workflow set evolves.

Research questions

  1. FF1 – Agent-First Architecture: What architectural requirements must a document editor fulfill to support reliable autonomous agent operation?
  2. FF2 – Tool Bloat: How does the number of simultaneously available tools in the context window affect execution accuracy, and can dynamic tool restriction (Active Discovery) reduce the error rate?
  3. FF3 – Workflow Restrictions: How effectively do architecturally enforced verification loops (ReAct, FSM) reduce error rates compared to unregulated environments?
  4. FF4 – Tool Granularity: How does the choice between atomic primitives, composite macros, and bulk operations affect execution accuracy and latency?

Evaluation framework

The benchmarking engine (benchmark.py) autonomously runs tasks across all workflow configurations using Google Gemini 3 Flash as the LLM backend. For each step, it provisions a fresh document, sends a prompt to the agent, and compares the resulting HTML against a reference using Levenshtein distance after DOM normalization (BeautifulSoup4). Full tool-call traces are exported as JSON for qualitative analysis.

Error classification (Lu et al., 2025)

Every error is categorized into one of three types:

Workflow configurations

The Agent Scratchpad (persistent memory with self-conditioning detection, three tools plan_tasks / mark_task_done / abort_task) is active in workflows A–K and O. Workflows L, M, N, P, Q, R, S, and T disable it entirely — promoting the scratchpad from controlled constant to experimental variable along the 2×2 isolation matrix (verification × granularity).

WorkflowTools (exposed / initial)MechanismTests
A Baseline24 / 24All primitive tools, no restrictions, neutral scratchpadUnregulated baseline
B Bloat83 / 83format_text replaced with 46 atomic endpoints + 12 macros, expanded literal enumsSchema bloat, literal-enum confusion
C Discovery27 / 10On-demand discover_tools unlocks mutationsActive tool acquisition (MCP-Zero)
D Enforced Verification26 / 6Write tools auto-lock after every mutationInfrastructure-enforced verification loop
E Macros38 / 38Primitives + 12 composite macros incl. bulk opsTool granularity tradeoff
F FSM27 / 78-state finite-state machine (SEARCH → MUTATE → VERIFY)State-constrained ReAct
G FSM + Macros39 / 7FSM + macros (atomic calls skip MUTATE)Granularity × restriction interaction
H Instructed ReAct25 / 25Prompt-based verification instruction, no enforcementPrompt discipline vs. infrastructure
I Instructed ReAct + Macros38 / 38Instructed ReAct + 12 composite macrosPrompt discipline with macro shortcuts
J Observing FSM25 / 25State feedback after every action, no blockingFeedback alone vs. enforcement
K Observing FSM + Macros38 / 38J's feedback + E's macrosFeedback × macro interaction
L Bare Primitives21 / 21Primitives only, no scratchpad, no feedback, no verificationMinimal baseline (scratchpad isolation)
M Observing FSM + Macros, no Scratchpad35 / 35K without scratchpadPareto-optimal 100% workflow
N Macros, no Scratchpad35 / 35Macros without scratchpad and without feedbackIsolates feedback effect (L vs. M)
O Few-Shot ReAct25 / 25H's tools, but verification taught via few-shot exampleClosest to original Yao et al. 2023 ReAct
P Instructed ReAct + Macros, no Scratchpad Default35 / 35I without scratchpadProduction default for text + voice agent
Q Instructed ReAct + Primitives, no Scratchpad22 / 22H without scratchpadCompletes 2×2 scratchpad-isolation matrix
R Optimized FSM, no Scratchpad~35 / 7FSM + macros + batch mutations + intent inference, no scratchpadFSM roundtrip optimization
S Smart Verification FSM, no Scratchpad~35 / 7Selective enforcement: index-shifting ops force verify, formatting does notDifferentiated verification policy
T Smart Observing FSM, no Scratchpad~35 / 7Like S, but feedback only ("INDEX SHIFT — cached indices stale"), no blockingFeedback vs. enforcement under index drift

Key metrics

Theoretical foundations

The research builds on:

Benchmark results summary

Per-step averages across 13 benchmark scenarios. Workflow restrictions (Bench 12) is the primary cross-workflow stress test; tool-bloat workflows B and C are evaluated on the dedicated bloat benchmarks (Bench 10/11); endurance benchmark (Bench 13, 10-step risk report) is used to differentiate the high-performing workflows. Model: Google Gemini 3 Flash (Preview).

WorkflowPass RateAvg Tokens / stepAvg Tool Calls / stepBench
A Static Baseline100%142,65714.012
B Tool Bloat45%360,37217.810/11
C Active Discovery80%333,36820.510/11
D Enforced Verification100%460,42421.912
E Macros94%98,2389.412
F FSM87%583,78624.412
G FSM + Macros90%316,84616.612
H Instructed ReAct100%272,53718.912
I Instructed ReAct + Macros100%182,96713.112
J Observing FSM100%174,10615.212
K Observing FSM + Macros100%156,22113.312
L Bare Primitives60%78,0769.512
M Observing FSM + Macros, no SP Pareto100%65,9075.612
N Macros, no SP / no feedback80%102,3679.612
O Few-Shot ReAct100%153,22514.612
P Instructed ReAct + Macros, no SP Default100%71,9875.312
Q Instructed ReAct + Primitives, no SP92%146,44012.912
R Optimized FSM, no SP80%93,8667.512
S Smart Verification FSM, no SP100%64,5615.612
T Smart Observing FSM, no SP92%63,3045.612

Key findings. Tool Bloat (B, 83 tools) collapses to 0% on Bench 11 — the failure mode is not Lost-in-the-Middle on the tool list but parameter confusion across inflated literal enums (e.g., gold vs. yellow). Three verification mechanisms reach 100% on Bench 12: enforced gating (D), prompt instruction (H), and observing feedback (J) — ranked from most to least token-expensive. Removing the Agent Scratchpad on the macro workflows (M, P, S) is essentially free in accuracy but cuts per-step tokens to ~65k and tool calls to ~5. P (Instructed ReAct + Macros, no scratchpad) is deployed as the production default for both the text and voice agent — it Pareto-dominates I across Bench 12 and Bench 13 (Endurance) at 100% pass rate / ~73k tokens / 4.9 calls per step. Smart-Verification FSM (S) replicates this Pareto point with infrastructural index-drift guarantees, while its observing-only sibling (T) drops to 92% — selective enforcement matters where the model mis-computes its own insert lengths.

Contact & source code

If you're working on LLM agent evaluation, tool use benchmarking, or MCP tooling, I'd love to connect.

Contact via Email

Companion blog posts