Sections

Comprehensive guide to building production agentic AI systems — from ReAct patterns and tool design to multi-agent orchestration, memory, and evaluation. The fastest-growing area in AI engineering.

80 min read 17 sections 10 interview questions

agentsagentic systemsLLMtool useReActmulti-agentmemoryorchestration

What is an AI Agent?

An AI agent is a system where an LLM takes actions (tool calls) based on observations, iterating until a task is complete. Unlike a simple LLM call, agents can: (1) Use external tools (search, code execution, APIs) (2) Maintain state across multiple steps (3) Make decisions dynamically based on intermediate results (4) Complete open-ended, multi-step tasks autonomously Key distinction: Chains (fixed sequences) vs Agents (dynamic decision-making). An agent decides WHICH tool to call next based on the current state.

The ReAct Pattern (Reason + Act)

Thought

LLM reasons about the current state and what to do next. Internal monologue not shown to user.

Action

LLM selects a tool and provides arguments as structured JSON.

Observation

Tool executes and returns results. Results are added to the context.

Repeat

LLM sees the observation and decides next step. Continues until task is complete (no more tool calls needed).

ReAct Agent Loop — Reason, Act, Observe, Repeat

Rendering diagram...

Tool Design Principles

Tools are the actuators of an agent — the quality of tool design determines reliability. Each tool needs: (1) Clear unambiguous name, (2) Precise description of when and how to use it, (3) Minimal typed parameters, (4) Predictable structured output. Common tool types: read_file, write_file, run_shell, web_search, database_query, send_email, call_api, run_tests, code_execution. Critical: Make tools idempotent when possible. Retrying a tool call should be safe. Avoid side effects in read tools. Scope write tools to prevent catastrophic actions.

Tool Error Contract — Recoverable vs Terminal

Most agent failures are not reasoning failures — they are tool contract failures: ambiguous names, weak schemas, and useless errors. A production tool must return a structured error contract that lets the agent reason about whether to retry, repair, or escalate. Required fields: error_code (machine-readable), error_message (human-readable cause), recoverable (boolean — can the agent fix this and retry?), and retry_after_sec (for transient failures). "Error: file not found" leaves an agent looping; "FILE_NOT_FOUND, recoverable=true, hint: use list_files() to discover paths" gives the agent a path forward.

Minimal Tool Error Contract

jsontool_error.json

{
  "error_code": "RATE_LIMITED",
  "error_message": "Too many requests for this tenant",
  "recoverable": true,
  "retry_after_sec": 2
}

Retry Classification — Three Error Classes

Transient (retry with backoff)

Timeout, rate-limit, network error → bounded exponential backoff with jitter. Cap at 3 retries; longer cap risks runaway cost.

Correctable (let the model repair args)

Schema mismatch, missing required field, invalid argument → return validation error and let the agent try once or twice with corrected arguments.

Terminal (no retry, escalate)

Permission denied, unknown tool, policy blocked → no retry; force alternate plan, return diagnostic, or escalate to human. Retrying terminal errors wastes cost and can amplify side effects.

Loop detection

Hash (tool, normalized_args). If the same failing call repeats 3+ times, abort with diagnostic. Same-args repetition is the signature of a stuck agent.

Idempotency for side-effects

Every mutating call carries a client-generated operation ID. The backend deduplicates repeated submissions, so retries cannot duplicate emails, charges, or deletes.

IMPORTANT

Four Non-Negotiables for Production Agents

Every production agent needs four control mechanisms, not just one or two: (1) Termination semantics — explicit done action plus hard max-steps guard; (2) Tool contract enforcement — strict schema validation before execution, machine-readable errors after; (3) State discipline — scoped memory boundary (working vs persisted) with deterministic summarization; (4) Observability + policy controls — per-step tracing, token/cost accounting, permission boundaries, HITL approval for destructive actions. Teams that skip any one of these ship fragile demos, not production systems.

Agent Patterns Comparison

Pattern	Description	When to Use	Drawback
Sequential Chain	Fixed pipeline A→B→C	Dependent steps with known flow	No parallelism
Parallel Fan-Out	Multiple agents run concurrently	Independent subtasks	Coordination overhead
Supervisor/Worker	Orchestrator delegates to specialists	Complex tasks needing expertise routing	More complex debugging
Dynamic Graph (LangGraph)	Conditional routing based on state	Complex workflows with loops	Hardest to reason about
Critic/Refinement	Generator + evaluator loop	Quality-sensitive outputs	Risk of infinite loops

Multi-Agent Architecture — Supervisor + Specialist Workers

Rendering diagram...

TIP

When to Use Multi-Agent

Use multi-agent only when: (1) Task is too large for one context window, (2) Subtasks genuinely benefit from specialization, (3) Parallel execution provides significant wall-clock speedup. Don't add agents just because you can. A single well-prompted agent with good tools is often simpler, cheaper, and more reliable than a multi-agent system.

The Four Types of Agent Memory

Working Memory

Current conversation context window. Fast, limited to ~128K tokens. Gone after session.

Episodic Memory

Past conversations stored in vector DB. Retrieved by semantic similarity + recency. Persists weeks/months.

Semantic Memory

Extracted facts and preferences about users/context. Structured key-value store. Persists indefinitely.

Procedural Memory

Learned patterns about HOW to respond. Stored as prompt templates or fine-tuned adapters. Most durable.

Context Window Management

The core engineering challenge in agentic systems: codebases, documents, and histories are far larger than any context window. You must be surgical. Hierarchy: (1) Always-in-context: system prompt, repo map, user facts. (2) Retrieved-on-query: semantically relevant files/chunks (~20K tokens). (3) On-demand: exact file contents when tool calls read them. Avoid: putting everything in context, never summarizing, letting context grow unbounded. These lead to "lost in the middle" failures where the LLM ignores critical information because it's buried in a huge context.

Agent Failure Modes & Fixes

Failure Mode	Symptom	Root Cause	Fix
Lost Context	Re-reads same files	Working memory overflow	Structured scratchpad + summarization
Tool Hallucination	Wrong tool arguments	LLM invents parameters	Strict JSON schema validation with retries
Infinite Loop	Same action repeatedly	No progress detection	Track attempted approaches; add loop detector
Over-confidence	Doesn't ask for clarification	Missing uncertainty modeling	Add "confidence" to tool outputs
Scope creep	Modifies unrelated files	Weak scoping	Explicit allowed-path restrictions
Stale Context	Acts on outdated observations	Too many steps between updates	Re-read key resources periodically

NOTE

Popular Agent Frameworks (2025-2026)

LangGraph: Best for complex workflows with conditional routing and cycles. Uses graph-based state machines. Production-ready. CrewAI: Fast to prototype role-based multi-agent workflows. Good for supervisor/worker patterns. Autogen (Microsoft): Multi-agent conversation framework. Good for code-generation and iteration loops. Custom Python + Anthropic/OpenAI SDK: Maximum control. Best for performance-critical or unique architectural requirements.

LLM Evaluation for Agents

Evaluating agents is harder than evaluating single LLM calls because: (1) Tasks are open-ended, (2) Multiple valid paths exist, (3) Non-deterministic. Key metrics: Task completion rate (do tests pass?), Turn efficiency (tool calls per task), Tool error rate (% of tool calls that fail), Quality (human or LLM-as-judge rating). Standard benchmark: SWE-bench (GitHub issues, pass associated tests). State-of-the-art: ~50-60% pass rate as of 2026.

Safety Guardrails for Production Agents

Input validation

Check for prompt injection attempts. Sanitize user inputs before they reach tool calls.

Scope limiting

Restrict tool access to allowed paths, domains, APIs. Never allow unrestricted filesystem or network access.

Destructive action confirmation

For irreversible actions (delete, send email, push code), require explicit human confirmation.

Output validation

Check agent outputs before surfacing to users. Use separate safety classifier if needed.

Audit logging

Log every tool call with inputs/outputs. Essential for debugging and compliance.

Rate limiting

Limit tool calls per session to prevent runaway agents consuming excessive resources.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 17 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Ready to put it into practice?

Start Solving

You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.

Open Coding Problem →