Skip to main content
Product Analytics·Intermediate

Data Quality Monitoring: Schema Drift, Null Rates, Freshness SLAs, and Anomaly Detection for Analytics

Bad data causes bad decisions faster than good models fix them. This guide covers Great Expectations-style contracts, dbt tests, freshness SLAs on Airflow/Dagster, null and uniqueness monitors, schema evolution in Avro/Protobuf, volume anomaly detection (STL, EWMA), and how Airbnb-class analytics engineering triages P0 pipeline breaks vs metric definition drift.

35 min read 8 sections 6 interview questions
Data QualitySchema DriftGreat Expectationsdbt TestsFreshness SLAAirflowAnomaly DetectionData ContractsNull Rate MonitoringAnalytics EngineeringVolume ChecksObservability

Data Quality Is a Product Surface

Analytics consumers (PMs, finance) assume tables are correct, complete, and timely. Data quality monitoring makes violations visible before they pollute dashboards and A/B reads.

Interviews expect a layered story: contract tests at ingest, transformation assertions in dbt, SLA monitors on job completion, statistical anomaly detection on row counts and key metrics.

The key insight is that data quality is not one check, it is a defense-in-depth system. Schema contracts prevent malformed payloads from entering the lake, transformation tests catch logic regressions, and freshness/volume monitors catch silent operational failures that still pass SQL syntax. Teams that rely on only one layer usually discover incidents from executives before monitors fire, which is exactly what high-quality analytics engineering is meant to prevent.

IMPORTANT

Interviewer Signals

Strong: Distinguish schema (columns/types), volume (row count spikes/drops), distribution (null rate, categorical entropy), freshness (latency from event to table), referential integrity — each has different detectors and severities.

Staff: Blame attribution — lineage from broken upstream Kafka topic to stale dashboard; canary metrics for releases.

Triage Order When a Metric Looks Wrong

01

Is the pipeline fresh?

Max event timestamp lag — often P0.

02

Did row counts move?

Join keys dropped — silent fanout loss.

03

Schema change?

Renamed column reads NULL in downstream SQL.

04

Definition drift?

PM changed funnel — data correct, semantics moved.

Concrete Checks That Catch Real Incidents

Freshness: max(event_ts) - now() threshold per table tier — gold tables tighter than staging.

Volume: Day-over-day row count z-score or EWMA band — catches silent WHERE clause bugs.

Null rate: null_fraction(user_id) alert if exceeds rolling baseline — SDK bug classic.

Uniqueness: Primary key duplicates after dedupe — join explosion signal.

Accepted values: Enum domain shrink/expansion — mobile OS version mapping errors.

In production, every check should include three fields: owner, severity, and runbook URL. Without these, alerting devolves into Slack noise with no clear resolution path. Another practical pattern is seasonality-aware baselines (weekday/weekend profiles) for volume and null rates; static thresholds create alert fatigue and are eventually ignored.

Strong interview answers also include a post-incident loop: after each P0, add one preventive contract or monitor that would have caught it earlier. This closes the reliability feedback loop and demonstrates systems thinking beyond one-off debugging.

Data Quality Layers — From Ingest to BI

Rendering diagram...

Check Types — Severity Guide

SignalTypical P0?DetectorOwner
Freshness breachYes — stale goldAirflow SLA sensorDE oncall
Row count minus 50%Yes — join bugAnomaly on daily insertsDE
Null spike on keyYes — loggingColumn stats monitorMobile plus DE
Slow drift in meanMaybe — real worldBayesian change pointDS plus DE
Schema add columnNo — informativeRegistry diffPlatform
⚠ WARNING

Alert Fatigue Kills Programs

Page only on consumer-facing SLA breaks. Route noisy checks to Slack with auto-ticket if persists N runs — tune thresholds from historical variance.

TIP

Interview One-Liner

"I tier tables by business criticality, put dbt uniqueness and not-null on keys, freshness SLAs on orchestration, and row-count EWMA with seasonality for silent failure — then tie each alert to an owner runbook."

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →