Sections
Related Guides
Funnel Analysis: Conversion Optimization, Drop-off Attribution, and Funnel SQL
Product Analytics
Root Cause Analysis Framework: Investigating Metric Drops and Production Incidents
Machine Learning
Product Analytics for Interviews: Metric Design, Root Cause Analysis, and Scenario Frameworks
Product Analytics
Data Quality Monitoring: Schema Drift, Null Rates, Freshness SLAs, and Anomaly Detection for Analytics
Bad data causes bad decisions faster than good models fix them. This guide covers Great Expectations-style contracts, dbt tests, freshness SLAs on Airflow/Dagster, null and uniqueness monitors, schema evolution in Avro/Protobuf, volume anomaly detection (STL, EWMA), and how Airbnb-class analytics engineering triages P0 pipeline breaks vs metric definition drift.
Data Quality Is a Product Surface
Analytics consumers (PMs, finance) assume tables are correct, complete, and timely. Data quality monitoring makes violations visible before they pollute dashboards and A/B reads.
Interviews expect a layered story: contract tests at ingest, transformation assertions in dbt, SLA monitors on job completion, statistical anomaly detection on row counts and key metrics.
The key insight is that data quality is not one check, it is a defense-in-depth system. Schema contracts prevent malformed payloads from entering the lake, transformation tests catch logic regressions, and freshness/volume monitors catch silent operational failures that still pass SQL syntax. Teams that rely on only one layer usually discover incidents from executives before monitors fire, which is exactly what high-quality analytics engineering is meant to prevent.
Interviewer Signals
Strong: Distinguish schema (columns/types), volume (row count spikes/drops), distribution (null rate, categorical entropy), freshness (latency from event to table), referential integrity — each has different detectors and severities.
Staff: Blame attribution — lineage from broken upstream Kafka topic to stale dashboard; canary metrics for releases.
Triage Order When a Metric Looks Wrong
Is the pipeline fresh?
Max event timestamp lag — often P0.
Did row counts move?
Join keys dropped — silent fanout loss.
Schema change?
Renamed column reads NULL in downstream SQL.
Definition drift?
PM changed funnel — data correct, semantics moved.
Concrete Checks That Catch Real Incidents
Freshness: max(event_ts) - now() threshold per table tier — gold tables tighter than staging.
Volume: Day-over-day row count z-score or EWMA band — catches silent WHERE clause bugs.
Null rate: null_fraction(user_id) alert if exceeds rolling baseline — SDK bug classic.
Uniqueness: Primary key duplicates after dedupe — join explosion signal.
Accepted values: Enum domain shrink/expansion — mobile OS version mapping errors.
In production, every check should include three fields: owner, severity, and runbook URL. Without these, alerting devolves into Slack noise with no clear resolution path. Another practical pattern is seasonality-aware baselines (weekday/weekend profiles) for volume and null rates; static thresholds create alert fatigue and are eventually ignored.
Strong interview answers also include a post-incident loop: after each P0, add one preventive contract or monitor that would have caught it earlier. This closes the reliability feedback loop and demonstrates systems thinking beyond one-off debugging.
Data Quality Layers — From Ingest to BI
Check Types — Severity Guide
| Signal | Typical P0? | Detector | Owner |
|---|---|---|---|
| Freshness breach | Yes — stale gold | Airflow SLA sensor | DE oncall |
| Row count minus 50% | Yes — join bug | Anomaly on daily inserts | DE |
| Null spike on key | Yes — logging | Column stats monitor | Mobile plus DE |
| Slow drift in mean | Maybe — real world | Bayesian change point | DS plus DE |
| Schema add column | No — informative | Registry diff | Platform |
Alert Fatigue Kills Programs
Page only on consumer-facing SLA breaks. Route noisy checks to Slack with auto-ticket if persists N runs — tune thresholds from historical variance.
Interview One-Liner
"I tier tables by business criticality, put dbt uniqueness and not-null on keys, freshness SLAs on orchestration, and row-count EWMA with seasonality for silent failure — then tie each alert to an owner runbook."
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →