Sections
Related Guides
RAG Architecture: From Basics to Production
GenAI & Agents
LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps
GenAI & Agents
Multi-Agent Systems: Orchestration, LangGraph, and Production Patterns
GenAI & Agents
Agent Memory Systems: In-Context, Semantic, Episodic, and Procedural
GenAI & Agents
LLM & Agent Evaluation: Trajectories, RAGAS, LLM-as-Judge, and Hallucination Mitigation
GenAI & Agents
Agentic RAG: ReAct, Self-RAG, and Multi-Step Retrieval
GenAI & Agents
AI Agents & Agentic Systems Framework
Comprehensive guide to building production agentic AI systems — from ReAct patterns and tool design to multi-agent orchestration, memory, and evaluation. The fastest-growing area in AI engineering.
What is an AI Agent?
An AI agent is a system where an LLM takes actions (tool calls) based on observations, iterating until a task is complete. Unlike a simple LLM call, agents can: (1) Use external tools (search, code execution, APIs) (2) Maintain state across multiple steps (3) Make decisions dynamically based on intermediate results (4) Complete open-ended, multi-step tasks autonomously Key distinction: Chains (fixed sequences) vs Agents (dynamic decision-making). An agent decides WHICH tool to call next based on the current state.
The ReAct Pattern (Reason + Act)
Thought
LLM reasons about the current state and what to do next. Internal monologue not shown to user.
Action
LLM selects a tool and provides arguments as structured JSON.
Observation
Tool executes and returns results. Results are added to the context.
Repeat
LLM sees the observation and decides next step. Continues until task is complete (no more tool calls needed).
ReAct Agent Loop — Reason, Act, Observe, Repeat
Tool Design Principles
Tools are the actuators of an agent — the quality of tool design determines reliability. Each tool needs: (1) Clear unambiguous name, (2) Precise description of when and how to use it, (3) Minimal typed parameters, (4) Predictable structured output. Common tool types: read_file, write_file, run_shell, web_search, database_query, send_email, call_api, run_tests, code_execution. Critical: Make tools idempotent when possible. Retrying a tool call should be safe. Avoid side effects in read tools. Scope write tools to prevent catastrophic actions.
Tool Error Contract — Recoverable vs Terminal
Most agent failures are not reasoning failures — they are tool contract failures: ambiguous names, weak schemas, and useless errors. A production tool must return a structured error contract that lets the agent reason about whether to retry, repair, or escalate.
Required fields: error_code (machine-readable), error_message (human-readable cause), recoverable (boolean — can the agent fix this and retry?), and retry_after_sec (for transient failures). "Error: file not found" leaves an agent looping; "FILE_NOT_FOUND, recoverable=true, hint: use list_files() to discover paths" gives the agent a path forward.
Minimal Tool Error Contract
{
"error_code": "RATE_LIMITED",
"error_message": "Too many requests for this tenant",
"recoverable": true,
"retry_after_sec": 2
}
Retry Classification — Three Error Classes
Transient (retry with backoff)
Timeout, rate-limit, network error → bounded exponential backoff with jitter. Cap at 3 retries; longer cap risks runaway cost.
Correctable (let the model repair args)
Schema mismatch, missing required field, invalid argument → return validation error and let the agent try once or twice with corrected arguments.
Terminal (no retry, escalate)
Permission denied, unknown tool, policy blocked → no retry; force alternate plan, return diagnostic, or escalate to human. Retrying terminal errors wastes cost and can amplify side effects.
Loop detection
Hash (tool, normalized_args). If the same failing call repeats 3+ times, abort with diagnostic. Same-args repetition is the signature of a stuck agent.
Idempotency for side-effects
Every mutating call carries a client-generated operation ID. The backend deduplicates repeated submissions, so retries cannot duplicate emails, charges, or deletes.
Four Non-Negotiables for Production Agents
Every production agent needs four control mechanisms, not just one or two: (1) Termination semantics — explicit done action plus hard max-steps guard; (2) Tool contract enforcement — strict schema validation before execution, machine-readable errors after; (3) State discipline — scoped memory boundary (working vs persisted) with deterministic summarization; (4) Observability + policy controls — per-step tracing, token/cost accounting, permission boundaries, HITL approval for destructive actions. Teams that skip any one of these ship fragile demos, not production systems.
Agent Patterns Comparison
| Pattern | Description | When to Use | Drawback |
|---|---|---|---|
| Sequential Chain | Fixed pipeline A→B→C | Dependent steps with known flow | No parallelism |
| Parallel Fan-Out | Multiple agents run concurrently | Independent subtasks | Coordination overhead |
| Supervisor/Worker | Orchestrator delegates to specialists | Complex tasks needing expertise routing | More complex debugging |
| Dynamic Graph (LangGraph) | Conditional routing based on state | Complex workflows with loops | Hardest to reason about |
| Critic/Refinement | Generator + evaluator loop | Quality-sensitive outputs | Risk of infinite loops |
Multi-Agent Architecture — Supervisor + Specialist Workers
When to Use Multi-Agent
Use multi-agent only when: (1) Task is too large for one context window, (2) Subtasks genuinely benefit from specialization, (3) Parallel execution provides significant wall-clock speedup. Don't add agents just because you can. A single well-prompted agent with good tools is often simpler, cheaper, and more reliable than a multi-agent system.
The Four Types of Agent Memory
Working Memory
Current conversation context window. Fast, limited to ~128K tokens. Gone after session.
Episodic Memory
Past conversations stored in vector DB. Retrieved by semantic similarity + recency. Persists weeks/months.
Semantic Memory
Extracted facts and preferences about users/context. Structured key-value store. Persists indefinitely.
Procedural Memory
Learned patterns about HOW to respond. Stored as prompt templates or fine-tuned adapters. Most durable.
Context Window Management
The core engineering challenge in agentic systems: codebases, documents, and histories are far larger than any context window. You must be surgical. Hierarchy: (1) Always-in-context: system prompt, repo map, user facts. (2) Retrieved-on-query: semantically relevant files/chunks (~20K tokens). (3) On-demand: exact file contents when tool calls read them. Avoid: putting everything in context, never summarizing, letting context grow unbounded. These lead to "lost in the middle" failures where the LLM ignores critical information because it's buried in a huge context.
Agent Failure Modes & Fixes
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Lost Context | Re-reads same files | Working memory overflow | Structured scratchpad + summarization |
| Tool Hallucination | Wrong tool arguments | LLM invents parameters | Strict JSON schema validation with retries |
| Infinite Loop | Same action repeatedly | No progress detection | Track attempted approaches; add loop detector |
| Over-confidence | Doesn't ask for clarification | Missing uncertainty modeling | Add "confidence" to tool outputs |
| Scope creep | Modifies unrelated files | Weak scoping | Explicit allowed-path restrictions |
| Stale Context | Acts on outdated observations | Too many steps between updates | Re-read key resources periodically |
Popular Agent Frameworks (2025-2026)
LangGraph: Best for complex workflows with conditional routing and cycles. Uses graph-based state machines. Production-ready. CrewAI: Fast to prototype role-based multi-agent workflows. Good for supervisor/worker patterns. Autogen (Microsoft): Multi-agent conversation framework. Good for code-generation and iteration loops. Custom Python + Anthropic/OpenAI SDK: Maximum control. Best for performance-critical or unique architectural requirements.
LLM Evaluation for Agents
Evaluating agents is harder than evaluating single LLM calls because: (1) Tasks are open-ended, (2) Multiple valid paths exist, (3) Non-deterministic. Key metrics: Task completion rate (do tests pass?), Turn efficiency (tool calls per task), Tool error rate (% of tool calls that fail), Quality (human or LLM-as-judge rating). Standard benchmark: SWE-bench (GitHub issues, pass associated tests). State-of-the-art: ~50-60% pass rate as of 2026.
Safety Guardrails for Production Agents
Input validation
Check for prompt injection attempts. Sanitize user inputs before they reach tool calls.
Scope limiting
Restrict tool access to allowed paths, domains, APIs. Never allow unrestricted filesystem or network access.
Destructive action confirmation
For irreversible actions (delete, send email, push code), require explicit human confirmation.
Output validation
Check agent outputs before surfacing to users. Use separate safety classifier if needed.
Audit logging
Log every tool call with inputs/outputs. Essential for debugging and compliance.
Rate limiting
Limit tool calls per session to prevent runaway agents consuming excessive resources.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 17 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →Start Solving
You've covered the theory. Now implement it from scratch and run your solution against hidden test cases.
Open Coding Problem →