Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

GenAI & Agents·Advanced

Scaling Agent Tool Catalogs: Discovery, Deferred Loading, and Context Efficiency

LLM agents degrade past 30 tools — selection accuracy drops and context explodes. Master dynamic tool discovery, deferred loading, and programmatic calling for generative AI agents scaling to thousands of tools. Production patterns and interview prep.

30 min read 3 sections 1 interview questions
Tool Search ToolProgrammatic Tool CallingMCPFAISSAnthropic ClaudeContext WindowTool RegistryDeferred LoadingTool NamespacingAgent Tool DesignOpenAI Function CallingBM25Tool DiscoveryTool Catalog Scaling

Why 30+ Tools Break Your Agent

Every tool definition injected into an LLM's context window costs tokens — typically 200–500 tokens per tool for name, description, and parameter schema. A modest multi-server MCP setup with GitHub, Slack, Sentry, Grafana, and Splunk easily consumes ~55,000 tokens in tool definitions alone, before the agent reads a single user message. That is 43% of a 128K context window spent on tool menus.

But context cost is only half the problem. Tool selection accuracy degrades measurably past ~30 tools. The model must evaluate every tool description, compare plausibility, and pick the right one — a task that becomes combinatorially harder as the catalog grows. Independent benchmarks across ~4,000 tools report retrieval accuracy of only 56–64% with current search methods, well below production floors. When the model picks the wrong tool, the failure cascades: wrong tool → wrong result → wrong reasoning → confident wrong answer.

The third pressure is latency. Each tool definition lengthens the system prompt, which increases time-to-first-token proportionally. For a 500-tool catalog at ~300 tokens per tool, that is 150K tokens of prompt — adding 3–8 seconds of prefill latency on typical GPU serving infrastructure, before the model generates a single character.

These three forces — context bloat, selection degradation, and latency inflation — create an engineering ceiling. Naive "load everything" architectures work for 5–10 tools. Beyond that, you need an architecture that treats tool definitions as a retrieval problem, not a context-stuffing problem. The solutions parallel what we already know from document RAG: index, search, load on demand, and keep the hot path lean.

IMPORTANT

What Interviewers Test on Tool Scaling

The interviewer is checking whether you understand that tool definitions are expensive context — not free metadata. A 6/10 answer says 'give the agent all the tools it needs.' A 9/10 answer explains why always-loaded catalogs break at scale (selection accuracy, context cost, latency), proposes a tiered architecture (critical tools always loaded, everything else deferred behind a search tool), and quantifies the tradeoff: ~55K tokens → ~8.7K tokens with deferred loading, an 85%+ reduction. Staff-level candidates also address prompt-cache preservation: deferred tools don't invalidate cache entries because they expand inline in the conversation body, not the system-prompt prefix.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.