Sections

Evaluating agents and LLM systems requires evaluating trajectories, not just outputs. Learn trajectory vs outcome evaluation, tool call accuracy, the GAIA benchmark gap (humans: 92%, best agents: ~50%), LLM-as-judge biases, RAGAS metrics for grounded generation, layered hallucination defense, non-deterministic CI gates, and how production teams run shadow evaluation and regression suites.

55 min read 19 sections 7 interview questions

Agent EvaluationLLM EvaluationGAIA BenchmarkLLM-as-JudgeTrajectory EvaluationTool Call AccuracyWebArenaHallucination MitigationRAGASSWE-benchAgentic SystemsBenchmark ContaminationConstitutional AINon-Deterministic CIProduction ML

Why Agent Evaluation Is Fundamentally Harder Than Model Evaluation

Evaluating a standard LLM is relatively simple: give it an input, score its output against a reference answer or rubric. Accuracy on a benchmark is a single number.

Evaluating an agent is fundamentally different because agents produce trajectories — sequences of tool calls, reasoning steps, observations, and decisions — not single outputs. This creates evaluation problems that have no equivalent in standard model evaluation.

Correct answer, wrong path: An agent that solves a math problem by hallucinating an intermediate step and happening to get the right final answer will score 100% on outcome evaluation. But its reasoning is brittle — change the numbers slightly and it fails. A trajectory evaluator would catch that the intermediate step was wrong, flagging this agent as unreliable despite its apparent success.

Wrong answer, correct reasoning: An agent might follow exactly the right approach but fail because a retrieval tool returned outdated data. Outcome evaluation marks this as a failure; trajectory evaluation reveals the failure was in the environment (tool reliability), not the agent's reasoning. This distinction matters enormously for diagnosis and improvement.

The evaluation environment problem (the insight most candidates miss): Agents look better or worse depending on tool reliability. A flaky web search API that returns timeouts on 20% of calls will make the agent look unintelligent — it retries, gets inconsistent results, and eventually gives a confused answer. The failure is infrastructure, not reasoning. Production evaluation must measure agent quality holding environment quality constant.

IMPORTANT

What Interviewers Are Testing on Agent Evaluation

The interviewer wants to know if you can articulate (1) the trajectory vs outcome distinction, (2) how you'd set up a practical evaluation pipeline for a non-trivial agent, and (3) what you'd do when LLM-as-judge gives inconsistent results. Most candidates talk only about outcome metrics. The candidates who get offers discuss trajectory logging, step-level evaluation, and the contamination risks in published benchmarks.

Trajectory Evaluation vs Outcome Evaluation

Outcome evaluation asks a binary question: did the agent complete the task? Score = 1 if the final state matches the goal, 0 otherwise. This is easy to implement and aligns with user value (users care about whether the task got done), but it's a coarse signal that misses reasoning quality.

Trajectory evaluation evaluates each step in the agent's reasoning chain:

Did the agent identify the right sub-tasks to solve?
Did it call the right tools with the right parameters at each step?
Did it correctly interpret tool outputs?
Did it make a coherent next decision given the observation?
Did it terminate at the right point (not too early, not too late)?

Trajectory evaluation requires a reference trajectory or a rubric that specifies what the optimal trajectory looks like. For complex open-ended tasks, generating reference trajectories is expensive — often requiring senior engineers to manually trace optimal paths. This is why trajectory evaluation is common in research (WebArena, GAIA have reference trajectories) but underused in production.

The hybrid approach used in practice: Run outcome evaluation on all queries. For queries where the outcome fails, run trajectory evaluation to diagnose why it failed (wrong tool? wrong parameter? ignored observation? premature termination?). This targets the expensive trajectory evaluation where it provides the most signal.

Tool Call Accuracy: Precision and Recall on Tool Selection

Tool call evaluation is the most granular form of trajectory evaluation. For each step in the agent's execution, you measure:

Tool selection precision: Of the tool calls the agent made, what fraction were the correct tool for the situation? A coding agent that calls web_search instead of run_tests to verify its code has low tool selection precision.

Tool selection recall: Of the tool calls the agent should have made, what fraction did it actually make? An agent that solves a problem without calling retrieve_context when context was clearly needed has low tool selection recall.

Parameter correctness: Even when the right tool is selected, were the parameters correct? vector_search(query="Python syntax") when the correct query is vector_search(query="Python asyncio event loop") selects the right tool but with a poorly specified parameter. This is measured by semantic similarity between the agent's parameter and the optimal parameter.

WebArena (Zhou et al., 2023) is the canonical benchmark that measures these metrics on web navigation tasks. Agents are scored on task completion rate, but the paper's analysis breaks down failures by: wrong element clicked (tool selection error), wrong text typed (parameter error), correct action but wrong timing (sequencing error).

Production implementation: Log all tool calls with their arguments. Build an automated evaluator that checks each log against a set of rules: "for query type X, tool Y should be called before tool Z", "parameter of type P should match schema S". Flag deviations for human review.

Hallucination Types in Agents: Impact and Detection

Hallucination Type	Description	Example	Impact	Detection Method
Factual hallucination	Claims non-existent facts	Citing a paper that doesn't exist	Medium	Retrieval verification: check if citation exists in knowledge base
Tool hallucination	Claims to have called a tool it didn't	Says 'I searched and found X' without searching	High	Compare claimed tool calls to actual tool call log
Action hallucination	Claims to have taken an action it didn't	Says 'I sent the email' when no email was sent	Critical	Side-effect verification: check email queue, DB writes, API logs
Parameter hallucination	Called real tool with fabricated parameters	SQL query with non-existent column names	High	Tool execution error logs + parameter schema validation
Capability hallucination	Claims capabilities it doesn't have	Says 'I can see your screen' without vision tools	Medium	Tool registry validation: check claimed capabilities vs registered tools

GAIA: The Benchmark That Reveals the Real Gap

GAIA (General AI Assistants benchmark, Mialon et al., 2023, Meta AI + HuggingFace) is the most revealing benchmark for production agent capability. It contains 466 real-world tasks requiring multi-step reasoning and tool use: web search, code execution, file reading, and mathematical reasoning. Tasks range from simple one-step queries to complex multi-day research tasks.

The numbers that matter:

Human baseline: 92% success rate
GPT-4 with tools (2023): ~30%
Best agents as of 2024: ~50–60% (AutoGPT-style, GPT-4o with code interpreter)
Human vs best agent gap: ~32–42 percentage points

This enormous gap reveals that current agents are not approaching human performance on general multi-step tasks — despite performing near-human on narrow benchmarks like HumanEval (code generation). The GAIA failures cluster around: finding obscure information requiring multiple search reformulations, composing tools in non-obvious sequences, and handling ambiguous task specifications that humans resolve with common sense.

Why GAIA is more honest than HumanEval or GSM8K: Those benchmarks test the LLM's parametric knowledge. GAIA tests the agent's ability to use tools to find information it doesn't know. It's the difference between a test where you can use the internet vs. a closed-book exam. Real-world agent deployments face GAIA-style tasks, not HumanEval-style tasks.

Agent Evaluation Pipeline

Rendering diagram...

LLM-as-Judge: Capabilities, Biases, and When It Fails

LLM-as-judge (Zheng et al., 2023) uses a powerful LLM (typically GPT-4o or Claude 3.5 Sonnet) to evaluate another LLM's outputs. It's cheaper than human annotation (~$0.01–0.05 per evaluation vs. ~$1–5 for human) and faster (seconds vs. hours). Agreement with human ratings is ~0.85 on factual tasks and ~0.7 on creative/open-ended tasks.

Where it works well: Factual accuracy evaluation, format compliance, safety policy violations, response coherence. Any task with clear, objective criteria that can be specified in a rubric.

The four failure modes that matter in production:

Positional bias: LLMs rate the first response in a side-by-side comparison higher than the second, regardless of quality. In a study evaluating GPT-4-judge on response pairs, swapping the order changed the winner ~27% of the time. Mitigation: always evaluate in both orders; require the judge to justify its ranking; penalize positional wins without substantive explanation.

Self-preference bias (self-serving bias): GPT-4-judge rates GPT-4-generated content higher; Claude-judge rates Claude-generated content higher. The magnitude is ~5–15% on pairwise comparisons. Mitigation: use a different model family as judge than as generator. If your agent uses Claude 3.5, evaluate with GPT-4o as judge.

Verbosity bias: Longer, more detailed responses are rated higher even when shorter responses are more accurate. LLM judges are not immune to being fooled by confident, detailed-sounding responses. Mitigation: add explicit rubric criteria that penalize unnecessary verbosity; include a "conciseness" dimension with equal weight.

Score inflation over time: When LLM-as-judge is used to generate training data for RLHF or DPO, the resulting model learns to generate outputs that LLM-judge scores highly — including surface features like polite phrasing that judges reward but users don't value. Mitigation: periodically recalibrate judge scores against fresh human annotations.

Agent Evaluation Setup Checklist

Define task scope and success criteria

Before any evaluation infrastructure, specify: what does 'task complete' mean in binary terms? What is the acceptable error rate? What types of failures are catastrophic vs. acceptable? For a coding agent, 'no compilation errors' and 'all tests pass' are binary outcome criteria. 'Code is readable' requires trajectory/judge evaluation. Write these criteria down as a rubric before building any evaluator.

Instrument trajectory logging

Every tool call, its arguments, its return value, and the LLM's subsequent reasoning must be logged in a structured format. Use LangSmith, W&B Weave, or a custom trace store. Ensure logs include: timestamp per step, model used, token counts, tool call success/failure, and the full message history at each step. You cannot do trajectory evaluation without this data.

Build a curated regression suite

Manually collect 50–500 representative tasks that cover: easy cases (should always pass), medium cases (pass rate ~70–80%), hard cases (pass rate ~30–50%), and known past failures. Run this suite on every agent version change. A regression alert triggers if any category's pass rate drops > 5% from the baseline. This suite is more valuable than any external benchmark because it reflects your actual use case.

Set up outcome evaluation first

Implement automated outcome evaluation (binary pass/fail) before investing in trajectory evaluation. Outcome eval is cheap (often just checking if a file was created, a query returned the right answer, or a test passed), runs in seconds, and covers your full volume. Use it to establish a baseline pass rate. Only invest in trajectory evaluation for the failure cases that outcome eval catches.

Add LLM-as-judge for quality dimensions outcome can't measure

Outcome eval can't score 'did the agent ask the right clarifying questions?' or 'was the reasoning coherent?' Add LLM-as-judge for these dimensions with a written rubric. Use a different model family than your agent model to reduce self-preference bias. Calibrate the judge's scores against 50–100 human-annotated examples before trusting them in automated pipelines.

Implement shadow evaluation for agent version upgrades

When deploying a new agent version, run it in shadow mode: it processes real production traffic but its responses are not shown to users. Compare its trajectory logs and outcome scores against the production agent side-by-side. Promote the new version only if it shows a statistically significant improvement (p < 0.05 on a paired test) and no regressions on the curated suite. This prevents shipping agent upgrades that improve average performance but regress on specific important failure modes.

Benchmark Contamination and Inflated Results

SWE-bench (Jimenez et al., 2023) measures agents' ability to fix real GitHub issues from popular Python repositories. It became the de facto benchmark for coding agents. But by late 2024, several labs reported inflated performance because their training data included GitHub issue discussions, commit messages, and PR descriptions that referenced the exact issues in the benchmark.

The contamination mechanism: An agent trained on data that includes "Issue #1234: Fix the IndexError in list_comprehension.py — Solved by removing the off-by-one in line 42" effectively memorizes the solution. On evaluation, it appears to "reason" its way to the answer but is actually recalling training data. This isn't fabricated — it happens naturally when training corpora include GitHub at scale.

GAIA's contamination resistance: GAIA tasks involve finding specific obscure facts (e.g., "What is the highest mountain in the country that borders both X and Y?") that require combining retrieved information, not recalling memorized answers. Contamination is harder because the task can't be solved by memorizing a fixed answer.

What this means for your evaluation design: When building internal benchmarks, source tasks from your actual production data, not from public benchmarks. Public benchmarks are subject to contamination if your training data includes the internet. Your internal regression suite (curated from real failures) is contamination-immune because it was generated after your training cutoff.

Evaluation Methods Comparison

Method	Cost per Query	Throughput	Reliability	Applicability	Best For
Human annotation	$1–5	50–200/day	Highest (gold standard)	Any task type	Ground truth calibration, complex creative tasks
Automated outcome eval	$0.001–0.01	Thousands/min	High (if criteria are clear)	Binary success tasks	Regression testing, A/B evaluation at scale
LLM-as-judge	$0.01–0.05	Hundreds/min	Medium (~0.85 agreement)	Rubric-based quality	Quality dimensions, comparative evaluation
GAIA benchmark	N/A (public)	N/A (fixed set)	High (human-labeled)	General multi-step tasks	Capability benchmarking, model selection
WebArena	N/A (public)	N/A (fixed set)	High (env-verified)	Web navigation tasks	Browser agent evaluation
Shadow evaluation	$0.10–0.50	Real traffic rate	High (real distribution)	Any production agent	Pre-release version validation

⚠ WARNING

The Most Dangerous Agent Evaluation Mistake

Action hallucination — where an agent claims to have taken an action it didn't — is the most dangerous failure mode and the hardest to catch with outcome evaluation. An agent that says 'I've sent the report to all stakeholders' without actually calling the email API will score a passing outcome if your evaluator only checks the agent's final message, not the actual email queue. Always verify agent-claimed actions against ground truth state: check the database, the API call logs, the file system, or the email queue. Never trust the agent's self-reported action list as proof of execution. This is the evaluation gap that causes real-world production incidents.

TIP

The Answer That Shows Production Experience

When asked 'how would you evaluate this agent?', structure your answer in three layers: (1) automated outcome eval on all queries (cheap, catches regressions), (2) trajectory eval + LLM-as-judge on failures (diagnosis), (3) human annotation on 5% sample (calibration). Then mention the non-obvious insight: the evaluation environment matters as much as the agent. Control for tool reliability — if your search API has 20% error rate, the agent's apparent quality will be 20% worse than it actually is. Separating tool-caused failures from agent-caused failures is the most underrated part of agent evaluation.

RAG-Specific Evaluation: RAGAS Metrics

For agents that retrieve and ground responses (Agentic RAG, retrieval-heavy assistants), evaluation must measure retrieval quality and generation grounding separately — a correct final answer can come from broken retrieval (model fell back on parametric memory) and a wrong final answer can come from good retrieval (model ignored the context).

The RAGAS framework (Es et al., 2023) defines four production metrics:

Faithfulness: fraction of claims in the response that are entailed by retrieved context. Detects hallucinated facts even when retrieval was good.
Answer relevance: how well the response addresses the user's actual question. Detects topic drift and verbose-but-off-target answers.
Context precision: fraction of retrieved chunks that are actually useful for answering. Low precision means noisy retrieval polluting context.
Context recall: whether all needed facts are present in the retrieved context. Low recall means key documents weren't surfaced.

Run RAGAS metrics alongside outcome and trajectory evaluators. Diagnose by metric: low faithfulness + high context recall → generation problem (model ignored context); low context recall → retrieval problem (chunks missing or ranked wrong).

BLEU/ROUGE caveat: classical NLP metrics reward string overlap with a reference. Modern LLM outputs can be correct with very different wording, so BLEU/ROUGE under-score good answers and sometimes over-score wrong-but-similar text. Reject them as primary metrics.

Non-Deterministic CI: Statistical Gates Instead of Single-Run Pass/Fail

Run multiple seeds per test case

Single-run gates are brittle. Run each evaluation case 3–10 times with varied seeds (or temperature > 0) to get a sample distribution, not a point estimate.

Gate on confidence intervals, not means

Block release when the 95% CI of the new version's score lies fully below the baseline CI. Single-point regressions within noise band are not actionable.

Layered gates

Gate 1 (deterministic): schema validity, citation format, safety policy violations. Hard block. Gate 2 (quality): RAGAS, judge scores on golden + replay sets. Statistical block. Gate 3 (stability): repeated runs, distribution checks across slices.

Slice by intent and risk class

Aggregate metrics hide regressions in specific intents. Track per-slice (e.g., 'refund_request', 'policy_lookup', 'multi_step_research') and gate on the worst slice, not the global mean.

Targeted human sampling

Random uniform human review wastes budget on easy cases. Sample 5–10% from high-risk, high-disagreement (judge ↔ outcome split), and boundary cases. Use these as calibration anchors for judge models.

Layered Hallucination Defense (for Generation-Heavy Agents)

Hallucination is a pipeline problem, not just a model problem: weak retrieval, noisy context, permissive prompts, and missing output checks all contribute. The interviewer trap is proposing a silver bullet ("just use lower temperature" or "just add RAG"). Production systems need layered defense where each layer catches a different failure class:

Grounding layer — retrieval + citations + context pruning. Pulls the right evidence and removes noise.
Generation layer — policy-aware prompt + structured output (JSON schema, claim IDs). Constrains the model from free-form drift.
Verification layer — citation span check, NLI/entailment, claim-to-evidence alignment. Independent of the generator.
Fallback layer — if any gate fails, return "I don't know" or HITL escalation rather than a speculative answer.

Each layer is independent; failure in one is caught by the next. Constitutional AI (Bai et al., 2022) belongs in layer 2 as a policy shaper — it improves behavior under self-critique and RLAIF, but it does not replace the verification layer. Treating Constitutional AI as your full reliability layer is a senior-level mistake.

Mitigation by Hallucination Failure Type

Failure Type	Detection	Mitigation	When It Fails
Unsupported factual claim	Faithfulness / NLI check	Require evidence-backed statements with claim IDs	Source itself is stale or wrong
Fabricated citation	Citation-to-source span alignment	Quote-span verification against retrieved chunks	Retriever returns near-duplicate docs
Tool/action hallucination	Compare claim vs execution log	Execution receipt requirement before final response	Missing observability on tool side
Unsafe content drift	Policy classifier on output	Constitutional constraints + refusal policy	Ambiguous policy boundaries

⚠ WARNING

Self-Reported Confidence Is Not a Guard

A common production failure: using the model's own 'I'm 90% confident' string as the gate. Confidence text is not calibrated reliability — it correlates weakly with correctness and can be elicited by prompt phrasing. Always bind claims to retrieved evidence or external tool truth. Use external signals (citation coverage, NLI score, side-effect verification) for routing decisions, not the model's self-report.

Tiered Gating — Balance Hallucination Reduction Against Over-Refusal

Verified high confidence

All checks pass (citation coverage > threshold, faithfulness > threshold, no policy hit) → return answer with sources.

Medium confidence

Some checks pass with warnings (e.g., partial citation coverage, mild contradiction). Return answer with explicit uncertainty marker and source list. Don't refuse — partial helpfulness beats forced refusal.

Low confidence or high risk

Faithfulness fails or policy hits high-severity classifier → safe fallback ("I don't know — here's what I checked") or HITL escalation.

Tune on the utility-safety frontier

Optimize task completion + user satisfaction jointly with policy-violation rate. A single refusal-rate target produces over-cautious systems that users abandon. Measure both, plot the frontier, and pick the operating point your product can support.

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.