Sections

0/8

Related Guides

Funnel Analysis: Conversion Optimization, Drop-off Attribution, and Funnel SQL

Product Analytics

33m

Root Cause Analysis Framework: Investigating Metric Drops and Production Incidents

Machine Learning

32m

Product Analytics for Interviews: Metric Design, Root Cause Analysis, and Scenario Frameworks

Product Analytics

40m

Quiz

← Back to Library

Product Analytics·Intermediate

Data Quality Monitoring: Schema Drift, Null Rates, Freshness SLAs, and Anomaly Detection for Analytics

Bad data causes bad decisions faster than good models fix them. This guide covers Great Expectations-style contracts, dbt tests, freshness SLAs on Airflow/Dagster, null and uniqueness monitors, schema evolution in Avro/Protobuf, volume anomaly detection (STL, EWMA), and how Airbnb-class analytics engineering triages P0 pipeline breaks vs metric definition drift.

35 min read 8 sections 6 interview questions

Data QualitySchema DriftGreat Expectationsdbt TestsFreshness SLAAirflowAnomaly DetectionData ContractsNull Rate MonitoringAnalytics EngineeringVolume ChecksObservability

Data Quality Is a Product Surface

Analytics consumers (PMs, finance) assume tables are correct, complete, and timely. Data quality monitoring makes violations visible before they pollute dashboards and A/B reads.

Interviews expect a layered story: contract tests at ingest, transformation assertions in dbt, SLA monitors on job completion, statistical anomaly detection on row counts and key metrics.

The key insight is that data quality is not one check, it is a defense-in-depth system. Schema contracts prevent malformed payloads from entering the lake, transformation tests catch logic regressions, and freshness/volume monitors catch silent operational failures that still pass SQL syntax. Teams that rely on only one layer usually discover incidents from executives before monitors fire, which is exactly what high-quality analytics engineering is meant to prevent.

IMPORTANT

Interviewer Signals

Strong: Distinguish schema (columns/types), volume (row count spikes/drops), distribution (null rate, categorical entropy), freshness (latency from event to table), referential integrity — each has different detectors and severities.

Staff: Blame attribution — lineage from broken upstream Kafka topic to stale dashboard; canary metrics for releases.

Triage Order When a Metric Looks Wrong

Is the pipeline fresh?

Max event timestamp lag — often P0.

Did row counts move?

Join keys dropped — silent fanout loss.

Schema change?

Renamed column reads NULL in downstream SQL.

Definition drift?

PM changed funnel — data correct, semantics moved.

Concrete Checks That Catch Real Incidents

Freshness: max(event_ts) - now() threshold per table tier — gold tables tighter than staging.

Volume: Day-over-day row count z-score or EWMA band — catches silent WHERE clause bugs.

Null rate: null_fraction(user_id) alert if exceeds rolling baseline — SDK bug classic.

Uniqueness: Primary key duplicates after dedupe — join explosion signal.

Accepted values: Enum domain shrink/expansion — mobile OS version mapping errors.

In production, every check should include three fields: owner, severity, and runbook URL. Without these, alerting devolves into Slack noise with no clear resolution path. Another practical pattern is seasonality-aware baselines (weekday/weekend profiles) for volume and null rates; static thresholds create alert fatigue and are eventually ignored.

Strong interview answers also include a post-incident loop: after each P0, add one preventive contract or monitor that would have caught it earlier. This closes the reliability feedback loop and demonstrates systems thinking beyond one-off debugging.

Data Quality Layers — From Ingest to BI

Rendering diagram...

Check Types — Severity Guide

Signal	Typical P0?	Detector	Owner
Freshness breach	Yes — stale gold	Airflow SLA sensor	DE oncall
Row count minus 50%	Yes — join bug	Anomaly on daily inserts	DE
Null spike on key	Yes — logging	Column stats monitor	Mobile plus DE
Slow drift in mean	Maybe — real world	Bayesian change point	DS plus DE
Schema add column	No — informative	Registry diff	Platform

⚠ WARNING

Alert Fatigue Kills Programs

Page only on consumer-facing SLA breaks. Route noisy checks to Slack with auto-ticket if persists N runs — tune thresholds from historical variance.

TIP

Interview One-Liner

"I tier tables by business criticality, put dbt uniqueness and not-null on keys, freshness SLAs on orchestration, and row-count EWMA with seasonality for silent failure — then tie each alert to an owner runbook."

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.