Sections

ML interview: unsupervised and semi-supervised anomaly detection for tabular, logs, and monitoring — Isolation Forest path length, LOF, autoencoder reconstruction, ECOD tail-scores, PyOD, contamination, concept drift, and precision@K when ground-truth labels are rare.

58 min read 13 sections 8 interview questions

Anomaly DetectionIsolation ForestLOFAutoencoderECODOne-Class SVMPyODReconstruction ErrorUnsupervised LearningContaminationPrecision at KMachine Learning InterviewFraudMonitoringConcept Drift

Why 'Anomaly' Is Three Different Interview Problems

Interview prep often collapses anomaly detection, novelty detection, and outlier detection into one phrase. In production, they are different failure modes.

Outlier usually means: this point is far from the bulk of the training distribution. Novelty means: the model has never seen a pattern like this (often phrased as open-set recognition or low-density regions). Anomaly means: for this business object (transaction, host, user session), something is wrong relative to the expected behavior class — the norm may be multi-modal, not a single cluster.

Unsupervised anomaly training assumes a large majority of normal points and a small contamination rate of abnormal points mixed in, or a completely clean normal training set. If your training set already contains 5% attack traffic that looks like 5% of traffic, the unsupervised method learns 'attacks are part of the bulk' and you get silent failure.

The metric trap: without true anomaly labels, you only have proxy scores. Even with a validation set, rare positives make precision/recall estimates high-variance. The operational metric is often precision@K (top-K alerts a human can review) or time-to-detect under a false-positive budget, not a single AUC from a shuffled test split.

IMPORTANT

What Interviewers Evaluate (DRIFT)

Define: anomaly vs outlier, supervised vs unsupervised, contamination vs clean normal training data.

Reason: path-length (Isolation Forest) vs local density (LOF) vs reconstruction (autoencoders) vs multivariate tail probability (ECOD).

Identify failure: unlabeled training contaminated; scoring drift; treating anomaly scores as calibrated probabilities; evaluating with accuracy when the positive class is on the order of ~0.1% in typical severe-fraud style regimes and labels are missing.

Fix / alternative: one-class SVM for low-D geometry; VAEs or flows for generative scores; two-stage (filter + classifier); semi-supervised or weak labels; ensemble scores from PyOD.

Test / validate: stability under subsampling, alert budgets, per-segment false-positive rates, backtesting after concept drift, shadow scoring before promote.

Clarifying Questions — Before You Pick an Algorithm

Labels and base rate

Do we have any labeled anomalies (even a few hundred), only normal data, or a mix where anomalies may hide in 'training' rows? The answer changes everything — pure unsupervised, semi-supervised (PU learning), or straight classification with imbalance handling.

Operational capacity

Analysts can review K alerts per day. That fixes precision@K as the head metric, not AUC-ROC on a 99.9% negative validation set (TN dominates the ROC curve — same story as imbalanced classification).

Feature regime

Tabular mixed types vs logs vs images. Tree models scale on tabular; autoencoders need careful scaling; LOF and k-NN suffer from curse of dimensionality without dimensionality reduction or robust scaling.

Concept drift

Normal behavior drifts (product launches, seasonality, marketing campaigns). A static threshold on an anomaly score will either flood alerts or go blind. You need time-aware baselines or per-segment models.

Latency and cost

Batch nightly scoring vs <100ms online. Isolation Forest and lightweight AE heads fit online; large transformers over raw logs do not on CPU-only paths.

Isolation Forest — Why Short Paths Flag Anomalies (Liu et al. 2008)

An isolation intuition: a 'normal' point in the interior of a dense region needs many random splits to be isolated. An anomaly is in a sparse region; random axis-aligned cuts isolate it in fewer steps. Isolation Forest builds an ensemble of random trees on subsamples, and uses path length from root to leaf as the anomaly score (short = more anomalous, after normalization).

Complexity: building trees is typically O(ψ n log n)-class work per tree with subsample size ψ (the paper and sklearn default ψ = 256 for scalability). The method scales to large tabular problems without O(n²) kernel matrices.

Why it is not a silver bullet:

High-dimensional noise: in hundreds of random-looking dimensions, every point can look 'easy to isolate' without meaningful structure. Pair with feature selection, PCA, or domain aggregates.
Clustered anomalies: if a large botnet shares tight behavior, the forest may not assign short paths to each member (they are not isolated in feature space the same way as single outliers).
Contaminated training: with high contamination, dense malicious clusters look 'normal' to the model.

LOF — Global Distance vs Local Density (Breunig et al. 2000)

Local Outlier Factor (LOF) compares the local density of a point to the local densities of its k nearest neighbors. A point in a global outlier class may still be unremarkable if the interview question only considers Euclidean distance. LOF can flag points that are not far from the origin but are locally sparse compared to their neighborhood (common in multiscale data — a moderate transaction in a 'quiet' subpopulation).

Cost: O(n²) in naive form for all-pairs; practical implementations use k-d trees or approximate NN for O(n log n) per query in low-to-moderate dimension. In very high dimension, distance concentration erodes the meaning of nearest (the curse of dimensionality) — you often need UMAP/PCA, entity embeddings, or hand-built aggregates before LOF is meaningful.

When LOF beats Isolation Forest: when anomalies are relative to a local manifold (per-region credit-card behavior) rather than global extrema. When it loses: huge n, very high p without reduction, and when you need a fast fit across streaming updates without re-indexing the whole k-NN graph.

One-Class SVM and Autoencoders — Two Different 'Boundary' Visions

One-Class SVM (Schölkopf et al.) maps data (often via a kernel) and finds a small-volume region in feature space that captures most of the data; points outside the boundary score as anomalies. It is powerful when a nonlinear boundary in moderate dimension is the right inductive bias. In production, scalability to millions of training points is the pain — RBF one-class SVMs do not trivially out-scale gradient-boosted trees on wide sparse tabular data; they remain an interview favorite for the geometry story.

Autoencoders (AE) / VAEs: train to reconstruct normal data. Reconstruction error (per-sample MSE, or ELBO in VAE) is the anomaly score. This works when anomalies are hard to represent in a bottleneck that learned the normal manifold (images, logs compressed to a semantic embedding). The failure mode is too-capacious networks that reconstruct everything well — then you need stricter bottlenecks, denoising objectives, or contrastive pre-training before the AE head.

ECOD — Tail Probabilities Without Hyperparameters (Li et al. 2022)

ECOD (Empirical CDF Outlier Detection) (Li, Zhao, Hu, Botta, Ionescu, Chen — IEEE TKDE 2022, arXiv:2201.00382) takes a different view: in each feature dimension, estimate the empirical CDF (nonparametric). For each point, look at its left- and right-tail probabilities. Outliers are rare in the joint tails; ECOD aggregates per-dimension tail extremeness into a single score. The headline property for interviews: no hyperparameters in the default formulation — attractive when grid search is expensive in production. On 30 tabular benchmark datasets, the paper reports that ECOD outranks the set of 11 baselines they compare against, with a reported ~2% relative improvement in ROC and ~5% in average precision vs the second-best in their tables — treat those as paper-reported gains on benchmarks, not universal laws.

Caveat: independence across features is a modeling shortcut; the method can still be strong empirically on mixed-type tabular, but for structured dependency between features, copula ideas (COPOD — same PyOD family) or domain-specific feature engineering can matter.

Production scoring loop — from batch model to on-call

Rendering diagram...

Method comparison — when each primitive wins in practice

Method	Core signal	Scales to large n	High-dimensional tabular	Typical failure
Isolation Forest	Path length in random trees · sparse points isolated early	Strong · subsamples per tree	OK with trees + some feature eng	Contaminated training · clustered anomalies not isolated
LOF	Local density vs neighbors	Weak · k-NN structure · approx methods help	Needs reduction or denoised features	Curse of dimensionality · unstable k choice
One-class RBF SVM	Tight boundary around bulk in feature space	Training cost O(n²–n³) kernel regime limits n	Moderate p only	Kernel/tuning cost · poor on ultra-wide sparse one-hot data
Autoencoder / VAE	Reconstruction or ELBO gap on normal manifold	Training cost of NN · inference cheap	Strong on images, embeddings, logs in latent space	Over-capacity net reconstructs anomalies · need bottleneck
ECOD	Tail area under per-dim ECDF	O(nd) for classic fit on matrix	Competitive on many benchmarks per paper (tabular)	Assumes per-feature story — copula/structure may still help
Ensemble + PyOD	Voting or averaging ranks across algorithms	Depends on max member cost	Often best practical default when uncertain	More ops surface · correlated errors if members redundant

Evaluation Without Ground Truth — What You Still Must Do

In many deployments, you never have a complete set of true anomalies. You can still: (1) inject synthetic point anomalies (if domain allows) and measure detection rate; (2) backtest on historical incidents (even 50 labeled incidents beat zero); (3) compare models in A/B on the same K-slot human queue and measure actioned incident rate, not just offline AUC; (4) run stability tests — if subsampling 90% of training changes scores wildly, the model is not stable enough to deploy. Point-biserial correlation or ranking lift of known labels in mixed sets is a second sanity check when you have a few hundred labeled events.

⚠ WARNING

Top Failure Modes in Production

Contaminated 'normal' data with embedded attacks. 2) Using anomaly score as a calibrated p-value for automated punishment — it is a relative rank, not a frequency until calibrated on labeled data. 3) Global threshold across heterogeneous segments — travel vs domestic users need separate baselines. 4) Precision measured on a random holdout when ops uses top-1000 scoring — the metrics diverge. 5) No alert budget so thresholds chase noise after every deploy.

sklearn + PyOD — Isolation Forest and ECOD side by side

pythonanomaly_baseline.py

import numpy as np
from sklearn.ensemble import IsolationForest
from pyod.models.ecod import ECOD

def fit_iforest(
    x: np.ndarray,
    contamination: float = 0.01,
    random_state: int = 42,
) -> IsolationForest:
    # contamination: expected fraction in training (upper bound) — tune carefully
    return IsolationForest(
        n_estimators=200,
        max_samples=256,
        contamination=contamination,
        random_state=random_state,
    ).fit(x)

def fit_ecod(x: np.ndarray) -> ECOD:
    return ECOD().fit(x)  # no hyperparameters in default

def compare_ranks(
    iforest: IsolationForest,
    ecod: ECOD,
    x: np.ndarray,
) -> float:
    """Spearman example: how correlated are two methods' outlier orderings?"""
    from scipy.stats import spearmanr
    s_if = -iforest.score_samples(x)  # higher = more anomalous in sklearn
    s_ec = ecod.decision_function(x)  # higher = more anomalous in PyOD
    r, _ = spearmanr(s_if, s_ec)
    return float(r)

TIP

Interview One-Liners That Signal Staff-Level Judgment

• 'We optimize precision@K under a fixed analyst headcount, not AUC, because the tail of the score distribution is all that matters in ops.'

• 'Isolation Forest and ECOD are both unsupervised, but the inductive bias is different — path length vs per-feature tail area — I would offline compare rank correlation on a labeled slice.'

• 'If training data is not clean normal, I don't trust unsupervised scores until we run contamination sensitivity and compare to weak-label approaches.'

• 'For drift, the fix is not just retraining — it is per-segment baselines, shadow scoring, and sometimes two-stage: anomaly filter to supervised re-ranker.'

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 16 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.