Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
AI Agents & Agentic Systems Framework
GenAI & Agents
Agent Governance Control Plane: Runtime Policy for LLM Tool Execution
GenAI & Agents
MCP Security and Tool Trust Boundaries for LLM Agent Systems
GenAI & Agents
Structured Output, Function & Tool Calling — JSON Schema, Strict Mode, Agent Safety
GenAI & Agents
LLM & Agent Evaluation: Trajectories, RAGAS, LLM-as-Judge, and Hallucination Mitigation
GenAI & Agents
Scaling Agent Tool Catalogs: Discovery, Deferred Loading, and Context Efficiency
LLM agents degrade past 30 tools — selection accuracy drops and context explodes. Master dynamic tool discovery, deferred loading, and programmatic calling for generative AI agents scaling to thousands of tools. Production patterns and interview prep.
Why 30+ Tools Break Your Agent
Every tool definition injected into an LLM's context window costs tokens — typically 200–500 tokens per tool for name, description, and parameter schema. A modest multi-server MCP setup with GitHub, Slack, Sentry, Grafana, and Splunk easily consumes ~55,000 tokens in tool definitions alone, before the agent reads a single user message. That is 43% of a 128K context window spent on tool menus.
But context cost is only half the problem. Tool selection accuracy degrades measurably past ~30 tools. The model must evaluate every tool description, compare plausibility, and pick the right one — a task that becomes combinatorially harder as the catalog grows. Independent benchmarks across ~4,000 tools report retrieval accuracy of only 56–64% with current search methods, well below production floors. When the model picks the wrong tool, the failure cascades: wrong tool → wrong result → wrong reasoning → confident wrong answer.
The third pressure is latency. Each tool definition lengthens the system prompt, which increases time-to-first-token proportionally. For a 500-tool catalog at ~300 tokens per tool, that is 150K tokens of prompt — adding 3–8 seconds of prefill latency on typical GPU serving infrastructure, before the model generates a single character.
These three forces — context bloat, selection degradation, and latency inflation — create an engineering ceiling. Naive "load everything" architectures work for 5–10 tools. Beyond that, you need an architecture that treats tool definitions as a retrieval problem, not a context-stuffing problem. The solutions parallel what we already know from document RAG: index, search, load on demand, and keep the hot path lean.
What Interviewers Test on Tool Scaling
The interviewer is checking whether you understand that tool definitions are expensive context — not free metadata. A 6/10 answer says 'give the agent all the tools it needs.' A 9/10 answer explains why always-loaded catalogs break at scale (selection accuracy, context cost, latency), proposes a tiered architecture (critical tools always loaded, everything else deferred behind a search tool), and quantifies the tradeoff: ~55K tokens → ~8.7K tokens with deferred loading, an 85%+ reduction. Staff-level candidates also address prompt-cache preservation: deferred tools don't invalidate cache entries because they expand inline in the conversation body, not the system-prompt prefix.