Sections
Related Guides
AI Agents & Agentic Systems Framework
GenAI & Agents
Agentic RAG: ReAct, Self-RAG, and Multi-Step Retrieval
GenAI & Agents
Multi-Agent Systems: Orchestration, LangGraph, and Production Patterns
GenAI & Agents
Agent Memory Systems: In-Context, Semantic, Episodic, and Procedural
GenAI & Agents
RAG Architecture: From Basics to Production
GenAI & Agents
Advanced RAG: Hybrid Retrieval, Reranking, and Production Architecture
GenAI & Agents
LLM Serving at Scale: vLLM, KV Cache, Batching, and LLMOps
GenAI & Agents
LLM & Agent Evaluation: Trajectories, RAGAS, LLM-as-Judge, and Hallucination Mitigation
Evaluating agents and LLM systems requires evaluating trajectories, not just outputs. Learn trajectory vs outcome evaluation, tool call accuracy, the GAIA benchmark gap (humans: 92%, best agents: ~50%), LLM-as-judge biases, RAGAS metrics for grounded generation, layered hallucination defense, non-deterministic CI gates, and how production teams run shadow evaluation and regression suites.
Why Agent Evaluation Is Fundamentally Harder Than Model Evaluation
Evaluating a standard LLM is relatively simple: give it an input, score its output against a reference answer or rubric. Accuracy on a benchmark is a single number.
Evaluating an agent is fundamentally different because agents produce trajectories — sequences of tool calls, reasoning steps, observations, and decisions — not single outputs. This creates evaluation problems that have no equivalent in standard model evaluation.
Correct answer, wrong path: An agent that solves a math problem by hallucinating an intermediate step and happening to get the right final answer will score 100% on outcome evaluation. But its reasoning is brittle — change the numbers slightly and it fails. A trajectory evaluator would catch that the intermediate step was wrong, flagging this agent as unreliable despite its apparent success.
Wrong answer, correct reasoning: An agent might follow exactly the right approach but fail because a retrieval tool returned outdated data. Outcome evaluation marks this as a failure; trajectory evaluation reveals the failure was in the environment (tool reliability), not the agent's reasoning. This distinction matters enormously for diagnosis and improvement.
The evaluation environment problem (the insight most candidates miss): Agents look better or worse depending on tool reliability. A flaky web search API that returns timeouts on 20% of calls will make the agent look unintelligent — it retries, gets inconsistent results, and eventually gives a confused answer. The failure is infrastructure, not reasoning. Production evaluation must measure agent quality holding environment quality constant.
What Interviewers Are Testing on Agent Evaluation
The interviewer wants to know if you can articulate (1) the trajectory vs outcome distinction, (2) how you'd set up a practical evaluation pipeline for a non-trivial agent, and (3) what you'd do when LLM-as-judge gives inconsistent results. Most candidates talk only about outcome metrics. The candidates who get offers discuss trajectory logging, step-level evaluation, and the contamination risks in published benchmarks.
Trajectory Evaluation vs Outcome Evaluation
Outcome evaluation asks a binary question: did the agent complete the task? Score = 1 if the final state matches the goal, 0 otherwise. This is easy to implement and aligns with user value (users care about whether the task got done), but it's a coarse signal that misses reasoning quality.
Trajectory evaluation evaluates each step in the agent's reasoning chain:
- Did the agent identify the right sub-tasks to solve?
- Did it call the right tools with the right parameters at each step?
- Did it correctly interpret tool outputs?
- Did it make a coherent next decision given the observation?
- Did it terminate at the right point (not too early, not too late)?
Trajectory evaluation requires a reference trajectory or a rubric that specifies what the optimal trajectory looks like. For complex open-ended tasks, generating reference trajectories is expensive — often requiring senior engineers to manually trace optimal paths. This is why trajectory evaluation is common in research (WebArena, GAIA have reference trajectories) but underused in production.
The hybrid approach used in practice: Run outcome evaluation on all queries. For queries where the outcome fails, run trajectory evaluation to diagnose why it failed (wrong tool? wrong parameter? ignored observation? premature termination?). This targets the expensive trajectory evaluation where it provides the most signal.
Tool Call Accuracy: Precision and Recall on Tool Selection
Tool call evaluation is the most granular form of trajectory evaluation. For each step in the agent's execution, you measure:
Tool selection precision: Of the tool calls the agent made, what fraction were the correct tool for the situation? A coding agent that calls web_search instead of run_tests to verify its code has low tool selection precision.
Tool selection recall: Of the tool calls the agent should have made, what fraction did it actually make? An agent that solves a problem without calling retrieve_context when context was clearly needed has low tool selection recall.
Parameter correctness: Even when the right tool is selected, were the parameters correct? vector_search(query="Python syntax") when the correct query is vector_search(query="Python asyncio event loop") selects the right tool but with a poorly specified parameter. This is measured by semantic similarity between the agent's parameter and the optimal parameter.
WebArena (Zhou et al., 2023) is the canonical benchmark that measures these metrics on web navigation tasks. Agents are scored on task completion rate, but the paper's analysis breaks down failures by: wrong element clicked (tool selection error), wrong text typed (parameter error), correct action but wrong timing (sequencing error).
Production implementation: Log all tool calls with their arguments. Build an automated evaluator that checks each log against a set of rules: "for query type X, tool Y should be called before tool Z", "parameter of type P should match schema S". Flag deviations for human review.
Hallucination Types in Agents: Impact and Detection
| Hallucination Type | Description | Example | Impact | Detection Method |
|---|---|---|---|---|
| Factual hallucination | Claims non-existent facts | Citing a paper that doesn't exist | Medium | Retrieval verification: check if citation exists in knowledge base |
| Tool hallucination | Claims to have called a tool it didn't | Says 'I searched and found X' without searching | High | Compare claimed tool calls to actual tool call log |
| Action hallucination | Claims to have taken an action it didn't | Says 'I sent the email' when no email was sent | Critical | Side-effect verification: check email queue, DB writes, API logs |
| Parameter hallucination | Called real tool with fabricated parameters | SQL query with non-existent column names | High | Tool execution error logs + parameter schema validation |
| Capability hallucination | Claims capabilities it doesn't have | Says 'I can see your screen' without vision tools | Medium | Tool registry validation: check claimed capabilities vs registered tools |
GAIA: The Benchmark That Reveals the Real Gap
GAIA (General AI Assistants benchmark, Mialon et al., 2023, Meta AI + HuggingFace) is the most revealing benchmark for production agent capability. It contains 466 real-world tasks requiring multi-step reasoning and tool use: web search, code execution, file reading, and mathematical reasoning. Tasks range from simple one-step queries to complex multi-day research tasks.
The numbers that matter:
- Human baseline: 92% success rate
- GPT-4 with tools (2023): ~30%
- Best agents as of 2024: ~50–60% (AutoGPT-style, GPT-4o with code interpreter)
- Human vs best agent gap: ~32–42 percentage points
This enormous gap reveals that current agents are not approaching human performance on general multi-step tasks — despite performing near-human on narrow benchmarks like HumanEval (code generation). The GAIA failures cluster around: finding obscure information requiring multiple search reformulations, composing tools in non-obvious sequences, and handling ambiguous task specifications that humans resolve with common sense.
Why GAIA is more honest than HumanEval or GSM8K: Those benchmarks test the LLM's parametric knowledge. GAIA tests the agent's ability to use tools to find information it doesn't know. It's the difference between a test where you can use the internet vs. a closed-book exam. Real-world agent deployments face GAIA-style tasks, not HumanEval-style tasks.
Agent Evaluation Pipeline
LLM-as-Judge: Capabilities, Biases, and When It Fails
LLM-as-judge (Zheng et al., 2023) uses a powerful LLM (typically GPT-4o or Claude 3.5 Sonnet) to evaluate another LLM's outputs. It's cheaper than human annotation (~$0.01–0.05 per evaluation vs. ~$1–5 for human) and faster (seconds vs. hours). Agreement with human ratings is ~0.85 on factual tasks and ~0.7 on creative/open-ended tasks.
Where it works well: Factual accuracy evaluation, format compliance, safety policy violations, response coherence. Any task with clear, objective criteria that can be specified in a rubric.
The four failure modes that matter in production:
Positional bias: LLMs rate the first response in a side-by-side comparison higher than the second, regardless of quality. In a study evaluating GPT-4-judge on response pairs, swapping the order changed the winner ~27% of the time. Mitigation: always evaluate in both orders; require the judge to justify its ranking; penalize positional wins without substantive explanation.
Self-preference bias (self-serving bias): GPT-4-judge rates GPT-4-generated content higher; Claude-judge rates Claude-generated content higher. The magnitude is ~5–15% on pairwise comparisons. Mitigation: use a different model family as judge than as generator. If your agent uses Claude 3.5, evaluate with GPT-4o as judge.
Verbosity bias: Longer, more detailed responses are rated higher even when shorter responses are more accurate. LLM judges are not immune to being fooled by confident, detailed-sounding responses. Mitigation: add explicit rubric criteria that penalize unnecessary verbosity; include a "conciseness" dimension with equal weight.
Score inflation over time: When LLM-as-judge is used to generate training data for RLHF or DPO, the resulting model learns to generate outputs that LLM-judge scores highly — including surface features like polite phrasing that judges reward but users don't value. Mitigation: periodically recalibrate judge scores against fresh human annotations.
Agent Evaluation Setup Checklist
Define task scope and success criteria
Before any evaluation infrastructure, specify: what does 'task complete' mean in binary terms? What is the acceptable error rate? What types of failures are catastrophic vs. acceptable? For a coding agent, 'no compilation errors' and 'all tests pass' are binary outcome criteria. 'Code is readable' requires trajectory/judge evaluation. Write these criteria down as a rubric before building any evaluator.
Instrument trajectory logging
Every tool call, its arguments, its return value, and the LLM's subsequent reasoning must be logged in a structured format. Use LangSmith, W&B Weave, or a custom trace store. Ensure logs include: timestamp per step, model used, token counts, tool call success/failure, and the full message history at each step. You cannot do trajectory evaluation without this data.
Build a curated regression suite
Manually collect 50–500 representative tasks that cover: easy cases (should always pass), medium cases (pass rate ~70–80%), hard cases (pass rate ~30–50%), and known past failures. Run this suite on every agent version change. A regression alert triggers if any category's pass rate drops > 5% from the baseline. This suite is more valuable than any external benchmark because it reflects your actual use case.
Set up outcome evaluation first
Implement automated outcome evaluation (binary pass/fail) before investing in trajectory evaluation. Outcome eval is cheap (often just checking if a file was created, a query returned the right answer, or a test passed), runs in seconds, and covers your full volume. Use it to establish a baseline pass rate. Only invest in trajectory evaluation for the failure cases that outcome eval catches.
Add LLM-as-judge for quality dimensions outcome can't measure
Outcome eval can't score 'did the agent ask the right clarifying questions?' or 'was the reasoning coherent?' Add LLM-as-judge for these dimensions with a written rubric. Use a different model family than your agent model to reduce self-preference bias. Calibrate the judge's scores against 50–100 human-annotated examples before trusting them in automated pipelines.
Implement shadow evaluation for agent version upgrades
When deploying a new agent version, run it in shadow mode: it processes real production traffic but its responses are not shown to users. Compare its trajectory logs and outcome scores against the production agent side-by-side. Promote the new version only if it shows a statistically significant improvement (p < 0.05 on a paired test) and no regressions on the curated suite. This prevents shipping agent upgrades that improve average performance but regress on specific important failure modes.
Benchmark Contamination and Inflated Results
SWE-bench (Jimenez et al., 2023) measures agents' ability to fix real GitHub issues from popular Python repositories. It became the de facto benchmark for coding agents. But by late 2024, several labs reported inflated performance because their training data included GitHub issue discussions, commit messages, and PR descriptions that referenced the exact issues in the benchmark.
The contamination mechanism: An agent trained on data that includes "Issue #1234: Fix the IndexError in list_comprehension.py — Solved by removing the off-by-one in line 42" effectively memorizes the solution. On evaluation, it appears to "reason" its way to the answer but is actually recalling training data. This isn't fabricated — it happens naturally when training corpora include GitHub at scale.
GAIA's contamination resistance: GAIA tasks involve finding specific obscure facts (e.g., "What is the highest mountain in the country that borders both X and Y?") that require combining retrieved information, not recalling memorized answers. Contamination is harder because the task can't be solved by memorizing a fixed answer.
What this means for your evaluation design: When building internal benchmarks, source tasks from your actual production data, not from public benchmarks. Public benchmarks are subject to contamination if your training data includes the internet. Your internal regression suite (curated from real failures) is contamination-immune because it was generated after your training cutoff.
Evaluation Methods Comparison
| Method | Cost per Query | Throughput | Reliability | Applicability | Best For |
|---|---|---|---|---|---|
| Human annotation | $1–5 | 50–200/day | Highest (gold standard) | Any task type | Ground truth calibration, complex creative tasks |
| Automated outcome eval | $0.001–0.01 | Thousands/min | High (if criteria are clear) | Binary success tasks | Regression testing, A/B evaluation at scale |
| LLM-as-judge | $0.01–0.05 | Hundreds/min | Medium (~0.85 agreement) | Rubric-based quality | Quality dimensions, comparative evaluation |
| GAIA benchmark | N/A (public) | N/A (fixed set) | High (human-labeled) | General multi-step tasks | Capability benchmarking, model selection |
| WebArena | N/A (public) | N/A (fixed set) | High (env-verified) | Web navigation tasks | Browser agent evaluation |
| Shadow evaluation | $0.10–0.50 | Real traffic rate | High (real distribution) | Any production agent | Pre-release version validation |
The Most Dangerous Agent Evaluation Mistake
Action hallucination — where an agent claims to have taken an action it didn't — is the most dangerous failure mode and the hardest to catch with outcome evaluation. An agent that says 'I've sent the report to all stakeholders' without actually calling the email API will score a passing outcome if your evaluator only checks the agent's final message, not the actual email queue. Always verify agent-claimed actions against ground truth state: check the database, the API call logs, the file system, or the email queue. Never trust the agent's self-reported action list as proof of execution. This is the evaluation gap that causes real-world production incidents.
The Answer That Shows Production Experience
When asked 'how would you evaluate this agent?', structure your answer in three layers: (1) automated outcome eval on all queries (cheap, catches regressions), (2) trajectory eval + LLM-as-judge on failures (diagnosis), (3) human annotation on 5% sample (calibration). Then mention the non-obvious insight: the evaluation environment matters as much as the agent. Control for tool reliability — if your search API has 20% error rate, the agent's apparent quality will be 20% worse than it actually is. Separating tool-caused failures from agent-caused failures is the most underrated part of agent evaluation.
RAG-Specific Evaluation: RAGAS Metrics
For agents that retrieve and ground responses (Agentic RAG, retrieval-heavy assistants), evaluation must measure retrieval quality and generation grounding separately — a correct final answer can come from broken retrieval (model fell back on parametric memory) and a wrong final answer can come from good retrieval (model ignored the context).
The RAGAS framework (Es et al., 2023) defines four production metrics:
- Faithfulness: fraction of claims in the response that are entailed by retrieved context. Detects hallucinated facts even when retrieval was good.
- Answer relevance: how well the response addresses the user's actual question. Detects topic drift and verbose-but-off-target answers.
- Context precision: fraction of retrieved chunks that are actually useful for answering. Low precision means noisy retrieval polluting context.
- Context recall: whether all needed facts are present in the retrieved context. Low recall means key documents weren't surfaced.
Run RAGAS metrics alongside outcome and trajectory evaluators. Diagnose by metric: low faithfulness + high context recall → generation problem (model ignored context); low context recall → retrieval problem (chunks missing or ranked wrong).
BLEU/ROUGE caveat: classical NLP metrics reward string overlap with a reference. Modern LLM outputs can be correct with very different wording, so BLEU/ROUGE under-score good answers and sometimes over-score wrong-but-similar text. Reject them as primary metrics.
Non-Deterministic CI: Statistical Gates Instead of Single-Run Pass/Fail
Run multiple seeds per test case
Single-run gates are brittle. Run each evaluation case 3–10 times with varied seeds (or temperature > 0) to get a sample distribution, not a point estimate.
Gate on confidence intervals, not means
Block release when the 95% CI of the new version's score lies fully below the baseline CI. Single-point regressions within noise band are not actionable.
Layered gates
Gate 1 (deterministic): schema validity, citation format, safety policy violations. Hard block. Gate 2 (quality): RAGAS, judge scores on golden + replay sets. Statistical block. Gate 3 (stability): repeated runs, distribution checks across slices.
Slice by intent and risk class
Aggregate metrics hide regressions in specific intents. Track per-slice (e.g., 'refund_request', 'policy_lookup', 'multi_step_research') and gate on the worst slice, not the global mean.
Targeted human sampling
Random uniform human review wastes budget on easy cases. Sample 5–10% from high-risk, high-disagreement (judge ↔ outcome split), and boundary cases. Use these as calibration anchors for judge models.
Layered Hallucination Defense (for Generation-Heavy Agents)
Hallucination is a pipeline problem, not just a model problem: weak retrieval, noisy context, permissive prompts, and missing output checks all contribute. The interviewer trap is proposing a silver bullet ("just use lower temperature" or "just add RAG"). Production systems need layered defense where each layer catches a different failure class:
- Grounding layer — retrieval + citations + context pruning. Pulls the right evidence and removes noise.
- Generation layer — policy-aware prompt + structured output (JSON schema, claim IDs). Constrains the model from free-form drift.
- Verification layer — citation span check, NLI/entailment, claim-to-evidence alignment. Independent of the generator.
- Fallback layer — if any gate fails, return "I don't know" or HITL escalation rather than a speculative answer.
Each layer is independent; failure in one is caught by the next. Constitutional AI (Bai et al., 2022) belongs in layer 2 as a policy shaper — it improves behavior under self-critique and RLAIF, but it does not replace the verification layer. Treating Constitutional AI as your full reliability layer is a senior-level mistake.
Mitigation by Hallucination Failure Type
| Failure Type | Detection | Mitigation | When It Fails |
|---|---|---|---|
| Unsupported factual claim | Faithfulness / NLI check | Require evidence-backed statements with claim IDs | Source itself is stale or wrong |
| Fabricated citation | Citation-to-source span alignment | Quote-span verification against retrieved chunks | Retriever returns near-duplicate docs |
| Tool/action hallucination | Compare claim vs execution log | Execution receipt requirement before final response | Missing observability on tool side |
| Unsafe content drift | Policy classifier on output | Constitutional constraints + refusal policy | Ambiguous policy boundaries |
Self-Reported Confidence Is Not a Guard
A common production failure: using the model's own 'I'm 90% confident' string as the gate. Confidence text is not calibrated reliability — it correlates weakly with correctness and can be elicited by prompt phrasing. Always bind claims to retrieved evidence or external tool truth. Use external signals (citation coverage, NLI score, side-effect verification) for routing decisions, not the model's self-report.
Tiered Gating — Balance Hallucination Reduction Against Over-Refusal
Verified high confidence
All checks pass (citation coverage > threshold, faithfulness > threshold, no policy hit) → return answer with sources.
Medium confidence
Some checks pass with warnings (e.g., partial citation coverage, mild contradiction). Return answer with explicit uncertainty marker and source list. Don't refuse — partial helpfulness beats forced refusal.
Low confidence or high risk
Faithfulness fails or policy hits high-severity classifier → safe fallback ("I don't know — here's what I checked") or HITL escalation.
Tune on the utility-safety frontier
Optimize task completion + user satisfaction jointly with policy-violation rate. A single refusal-rate target produces over-cautious systems that users abandon. Measure both, plot the frontier, and pick the operating point your product can support.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →