Design a production fraud detection system used by Stripe, PayPal, and Visa — end-to-end. Covers the three hard problems nobody teaches: extreme class imbalance (typically ~0.1% fraud rate), cost-sensitive learning where FN ≠ FP, and multi-stage inference under 100ms latency. Includes the optimal threshold formula, graph neural networks for fraud rings, adversarial drift, and the suppression bias feedback loop that silently kills deployed fraud models.

60 min read 4 sections 1 interview questions

Fraud DetectionClass ImbalanceCost-Sensitive LearningCascade InferenceGraph Neural NetworksVelocity FeaturesPR-AUCThreshold CalibrationAdversarial MLFeedback LoopsLightGBMFeature Store

Why Fraud Detection Is the Hardest Production ML Problem

Fraud detection sits at the intersection of three fundamental ML difficulties that almost never appear together in other problems.

Class imbalance so extreme it breaks standard metrics. Fraud is typically around 0.1% of all transactions — roughly 1 in 1,000 (see the widely-used Kaggle Credit Card Fraud Detection benchmark: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud, which has a ~0.172% fraud rate on 284K transactions). A model that predicts "not fraud" for every transaction achieves a trivial ~99.9% accuracy. That model is completely useless. Standard accuracy, F1, and even ROC-AUC become misleading. You need metrics designed specifically for this imbalance regime.

Asymmetric costs that standard loss functions ignore. Missing a $50,000 wire fraud is not the same mistake as blocking a $12 coffee purchase. A missed fraud causes direct financial loss plus chargeback fees (typically ~$25–$50; see Stripe's Radar documentation on chargeback mechanics at https://stripe.com/docs/disputes). A false positive blocks a legitimate customer, risks churn, and generates a support call. These costs are not equal, and any system that treats them as equal is making the wrong tradeoffs. The optimal model is not the one with the best F1 — it's the one that minimizes expected financial loss.

Sub-100ms latency with adversarial adaptation. A payment network must approve or decline in under 100ms, often 50ms. The sophisticated models you want to run cannot fit in that window. But crucially, unlike weather prediction or movie recommendations, your "class 1" — the fraudster — is actively studying your model's decisions, finding the boundary conditions, and evolving patterns to evade detection. The moment your model becomes effective, it becomes a target.

This problem is asked at Stripe, PayPal, Square, Visa, Mastercard, Amazon Payments, Robinhood, Coinbase, and every company that processes money. See Stripe Radar's own engineering overview (https://stripe.com/radar and https://stripe.com/guides/radar-guide) for a production-grade example. The architecture generalizes to any adversarial, imbalanced, low-latency classification problem: bot detection, account takeover, insurance claims fraud. For production architectures, see Stripe Radar (engineering post) and Uber's Mastermind fraud detection pipeline — both describe cascade-style inference with the same building blocks discussed here.

TIP

What Interviewers Are Evaluating

Mid-level: Do you know that accuracy is the wrong metric? Can you describe a multi-stage pipeline (rules → model)? Do you know velocity features are important? Can you describe class imbalance handling?

Senior-level: Do you use the cost-sensitive threshold formula instead of 0.5? Do you understand why calibration matters for the threshold calculation? Do you know velocity features must come from Redis (not batch)? Do you understand graph features for fraud rings? Can you build a latency budget?

Staff-level: Do you design the counterfactual logging system and identify suppression bias? Do you know GNN training pipelines for entity embeddings? Do you address regulatory requirements (SHAP explainability, audit trails)? Do you reason about who owns C_FN and C_FP in the org?

Clarifying Questions — Ask These First

What type of fraud are we detecting?

Card-present (POS terminal) vs card-not-present (online payment) vs account takeover vs first-party fraud (user claims fraud on their own transaction for chargeback). Each has different signal timing, features, and latency requirements. For this design: card-not-present online payment fraud — the hardest case.

What is the latency budget?

Payment authorization latency must be under 100ms at p99. Some networks require 50ms. This directly constrains which models can run in-line vs async. Establish this upfront — it's the key architectural constraint.

What are the approximate costs of false negatives vs false positives?

C_FN = average fraud transaction value + chargeback fee. C_FP = customer lifetime value × churn probability due to declined transaction. These numbers set the optimal decision threshold. If the interviewer doesn't give them, propose ranges and show the formula.

What scale?

Transactions per second: typical e-commerce is 1K TPS, peak 5K TPS. Visa processes 24,000 TPS globally. This drives infrastructure choices (horizontal scaling, async processing, Redis capacity).

What are the regulatory requirements?

Financial services are heavily regulated. Some jurisdictions require explainable decisions (SHAP/LIME values alongside every decline). PCI-DSS compliance affects data storage. These create hard constraints on model complexity and data retention.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade