How tool design changes LLM agent reliability — 20 workflows benchmarked

An open empirical study on autonomous document editing with Google Gemini 3 Flash. The Pareto-optimal tool design matches a 24-tool baseline at 100% pass rate while cutting tokens by ~54% and tool calls by ~60%. An 83-tool “bloated” workflow collapses to 0% on the hardest scenarios — not from Lost-in-the-Middle, but from parameter confusion across inflated literal enums.

Below: research questions, the 20 workflow configurations, per-step metrics across 13 benchmark scenarios, and the design rationale behind the production default (P / S).

Jump to results ↓ Read the write-up →

Context

This benchmark is the testbed for a master’s thesis at HFT Stuttgart: Agent-Driven Voice-Only Interfaces — Evaluation of Tool-Granularity, Tool-Bloat and Workflow-Constraints on an Agent-First Document Editor. Data is being re-collected and published openly as the workflow set evolves.

Research questions

FF1 – Agent-First Architecture: What architectural requirements must a document editor fulfill to support reliable autonomous agent operation?
FF2 – Tool Bloat: How does the number of simultaneously available tools in the context window affect execution accuracy, and can dynamic tool restriction (Active Discovery) reduce the error rate?
FF3 – Workflow Restrictions: How effectively do architecturally enforced verification loops (ReAct, FSM) reduce error rates compared to unregulated environments?
FF4 – Tool Granularity: How does the choice between atomic primitives, composite macros, and bulk operations affect execution accuracy and latency?

Evaluation framework

The benchmarking engine (benchmark.py) autonomously runs tasks across all workflow configurations using Google Gemini 3 Flash as the LLM backend. For each step, it provisions a fresh document, sends a prompt to the agent, and compares the resulting HTML against a reference using Levenshtein distance after DOM normalization (BeautifulSoup4). Full tool-call traces are exported as JSON for qualitative analysis.

Error classification (Lu et al., 2025)

Every error is categorized into one of three types:

Planning errors: Agent selected a locked tool or aborted – detected via SYSTEM INTERCEPT responses and abort_task calls
Execution errors: Agent selected the correct tool but passed invalid parameters – detected via backend validation errors
Response generation errors: All tool calls succeeded, but the final document deviates from the reference – the agent misunderstood the task

Workflow configurations

The Agent Scratchpad (persistent memory with self-conditioning detection, three tools plan_tasks / mark_task_done / abort_task) is active in workflows A–K and O. Workflows L, M, N, P, Q, R, S, and T disable it entirely — promoting the scratchpad from controlled constant to experimental variable along the 2×2 isolation matrix (verification × granularity).

Workflow	Tools (exposed / initial)	Mechanism	Tests
A Baseline	24 / 24	All primitive tools, no restrictions, neutral scratchpad	Unregulated baseline
B Bloat	83 / 83	`format_text` replaced with 46 atomic endpoints + 12 macros, expanded literal enums	Schema bloat, literal-enum confusion
C Discovery	27 / 10	On-demand `discover_tools` unlocks mutations	Active tool acquisition (MCP-Zero)
D Enforced Verification	26 / 6	Write tools auto-lock after every mutation	Infrastructure-enforced verification loop
E Macros	38 / 38	Primitives + 12 composite macros incl. bulk ops	Tool granularity tradeoff
F FSM	27 / 7	8-state finite-state machine (SEARCH → MUTATE → VERIFY)	State-constrained ReAct
G FSM + Macros	39 / 7	FSM + macros (atomic calls skip MUTATE)	Granularity × restriction interaction
H Instructed ReAct	25 / 25	Prompt-based verification instruction, no enforcement	Prompt discipline vs. infrastructure
I Instructed ReAct + Macros	38 / 38	Instructed ReAct + 12 composite macros	Prompt discipline with macro shortcuts
J Observing FSM	25 / 25	State feedback after every action, no blocking	Feedback alone vs. enforcement
K Observing FSM + Macros	38 / 38	J's feedback + E's macros	Feedback × macro interaction
L Bare Primitives	21 / 21	Primitives only, no scratchpad, no feedback, no verification	Minimal baseline (scratchpad isolation)
M Observing FSM + Macros, no Scratchpad	35 / 35	K without scratchpad	Pareto-optimal 100% workflow
N Macros, no Scratchpad	35 / 35	Macros without scratchpad and without feedback	Isolates feedback effect (L vs. M)
O Few-Shot ReAct	25 / 25	H's tools, but verification taught via few-shot example	Closest to original Yao et al. 2023 ReAct
P Instructed ReAct + Macros, no Scratchpad Default	35 / 35	I without scratchpad	Production default for text + voice agent
Q Instructed ReAct + Primitives, no Scratchpad	22 / 22	H without scratchpad	Completes 2×2 scratchpad-isolation matrix
R Optimized FSM, no Scratchpad	~35 / 7	FSM + macros + batch mutations + intent inference, no scratchpad	FSM roundtrip optimization
S Smart Verification FSM, no Scratchpad	~35 / 7	Selective enforcement: index-shifting ops force verify, formatting does not	Differentiated verification policy
T Smart Observing FSM, no Scratchpad	~35 / 7	Like S, but feedback only ("INDEX SHIFT — cached indices stale"), no blocking	Feedback vs. enforcement under index drift

Key metrics

Pass rate – percentage of benchmark steps with Levenshtein distance zero (exact HTML match)
Levenshtein distance – character-level edit distance between agent output and reference HTML (after DOM normalization)
Relative success – normalized against BASELINE control group (0% = no change, 100% = perfect match)
Planning / Execution / Response error counts – per Lu et al.'s taxonomy, including Context and Selection Hallucination sub-types
Token consumption – total prompt + candidate tokens per step, with peak and average context window sizes
Latency decomposition – total time split into Gemini inference (99.5%), MCP tool execution (0.4%), and orchestration overhead (0.1%)
Tool calls per step – number of LLM roundtrips, correlated with workflow restrictions
Context Hallucination rate – how often the agent attempted locked tools in dynamically restricted workflows

Theoretical foundations

The research builds on:

ReAct (Yao et al., 2023) – Thought-Action-Observation loops as baseline reasoning
StateAct (Rozanov & Rei, 2025) – Self-prompting and chain-of-states tracking
Lost in the Middle (Liu et al., 2024) – U-shaped attention distribution in long contexts
Self-Conditioning Effect (Sinha et al., 2026) – Error cascades from in-context conditioning
MCP-Zero (Fei et al., 2025) – Active tool discovery reducing token overhead by 98%
Tool hallucination taxonomy (Xu et al., 2025) – Selection vs. usage hallucinations, indecisive action space
Lu et al. (2025) – Three-category agent error taxonomy: planning, execution, response generation

Benchmark results summary

Per-step averages across 13 benchmark scenarios. Workflow restrictions (Bench 12) is the primary cross-workflow stress test; tool-bloat workflows B and C are evaluated on the dedicated bloat benchmarks (Bench 10/11); endurance benchmark (Bench 13, 10-step risk report) is used to differentiate the high-performing workflows. Model: Google Gemini 3 Flash (Preview).

Workflow	Pass Rate	Avg Tokens / step	Avg Tool Calls / step	Bench
A Static Baseline	100%	142,657	14.0	12
B Tool Bloat	45%	360,372	17.8	10/11
C Active Discovery	80%	333,368	20.5	10/11
D Enforced Verification	100%	460,424	21.9	12
E Macros	94%	98,238	9.4	12
F FSM	87%	583,786	24.4	12
G FSM + Macros	90%	316,846	16.6	12
H Instructed ReAct	100%	272,537	18.9	12
I Instructed ReAct + Macros	100%	182,967	13.1	12
J Observing FSM	100%	174,106	15.2	12
K Observing FSM + Macros	100%	156,221	13.3	12
L Bare Primitives	60%	78,076	9.5	12
M Observing FSM + Macros, no SP Pareto	100%	65,907	5.6	12
N Macros, no SP / no feedback	80%	102,367	9.6	12
O Few-Shot ReAct	100%	153,225	14.6	12
P Instructed ReAct + Macros, no SP Default	100%	71,987	5.3	12
Q Instructed ReAct + Primitives, no SP	92%	146,440	12.9	12
R Optimized FSM, no SP	80%	93,866	7.5	12
S Smart Verification FSM, no SP	100%	64,561	5.6	12
T Smart Observing FSM, no SP	92%	63,304	5.6	12

Key findings. Tool Bloat (B, 83 tools) collapses to 0% on Bench 11 — the failure mode is not Lost-in-the-Middle on the tool list but parameter confusion across inflated literal enums (e.g., gold vs. yellow). Three verification mechanisms reach 100% on Bench 12: enforced gating (D), prompt instruction (H), and observing feedback (J) — ranked from most to least token-expensive. Removing the Agent Scratchpad on the macro workflows (M, P, S) is essentially free in accuracy but cuts per-step tokens to ~65k and tool calls to ~5. P (Instructed ReAct + Macros, no scratchpad) is deployed as the production default for both the text and voice agent — it Pareto-dominates I across Bench 12 and Bench 13 (Endurance) at 100% pass rate / ~73k tokens / 4.9 calls per step. Smart-Verification FSM (S) replicates this Pareto point with infrastructural index-drift guarantees, while its observing-only sibling (T) drops to 92% — selective enforcement matters where the model mis-computes its own insert lengths.

Contact & source code

If you're working on LLM agent evaluation, tool use benchmarking, or MCP tooling, I'd love to connect.

Contact via Email

Companion blog posts

Tool Granularity in LLM Agents — an accessible write-up of what 20 workflow configurations × 13 scenarios revealed about atomic vs. composite MCP tool design.
Voice-First Document Editing with Gemini Live — native-audio architecture for the voice agent benchmarked alongside the text agent.
Rebuilding PDF + DOCX Export — the export pipeline used to materialise benchmark documents.