Preview — Pro guide
You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.
Sections
Related Guides
MLSD Case Study: Multimodal Content Moderation Systems
ML System Design
ML System Design: Real-Time Fraud Detection
ML System Design
Imbalanced Classification: Metrics, Class Weights, SMOTE, and Threshold Tuning
Machine Learning
Feature Engineering: Leakage-Safe Encoding, Interactions, Temporal, and Production Parity
Machine Learning
ML System Design: Social Feed Ranking System
ML System Design
MLSD Case Study: Graph-Aware Spam Detection
Design a Gmail/LinkedIn-style spam detection system combining content models, graph-based abuse signals, and velocity features. Covers adversarial adaptation, streaming detection, and class-specific action policies.
Why Spam Detection Requires Graph + Time, Not Just Text
Content-only classifiers fail quickly in production spam systems because attackers mutate text faster than models can be retrained. A campaign that starts with "Buy cheap meds now" pivots to leetspeak ("Buy ch3ap m3ds n0w"), then to image-embedded text, then to URL obfuscation — each mutation invalidates keyword rules and shifts content model distributions. Relying on text alone puts you in a permanent arms race you cannot win.
Production systems win by making text mutation insufficient. They combine three orthogonal signal families that are much harder to simultaneously evade:
- Content semantics: embedding-based classifiers capture intent even through surface mutation. A phishing message always contains urgency cues, credential request patterns, and authority impersonation signals that survive superficial text changes.
- Sender and entity reputation: abuse campaigns reuse infrastructure. Domains, IP blocks, device fingerprints, and account clusters accumulate abuse history that persists even when message text changes. A sender with a clean message but a domain registered 3 hours ago in a known bulletproof-hosting range is high-risk regardless of what they wrote.
- Velocity and burst patterns: legitimate senders behave within bounded volume patterns. A sender emitting 5,000 messages per hour to previously-uncontacted recipients is exhibiting an abuse signal that is independent of message content entirely.
The strongest interview answers articulate entity-level reputation and network effects explicitly: a sender account can look perfectly benign in isolation but belong to a coordinated abuse cluster that shares device fingerprints or recipient overlap with known spam campaigns. Graph-based risk propagation surfaces these cluster-level signals that per-entity heuristics miss entirely.