Sections
Related Guides
How to Approach Data & Product Scenario Questions
Scenarios
Scenario Walkthrough: Why Is DAU Dropping?
Scenarios
Scenario Walkthrough: Post-Launch — Was This Feature a Success?
Scenarios
Scenario Walkthrough: The A/B Test Went Wrong — SRM, Peeking, and Interference
Scenarios
Metric Design for Data Scientists: North Star Metrics, Guardrails, and Causal Attribution
Machine Learning
A/B Testing & Experimentation at Scale
Machine Learning
ML Evaluation Metrics: The Complete Guide
Machine Learning
Feature Flags: Safe Rollouts, Kill Switches, and the Dark Launch Pattern
Production Engineering
A/B Test Critique: Finding Flaws in Experiment Designs
Production Engineering
Scenario Walkthrough: Engagement vs Revenue — Guardrails & Horizon
The highest-signal PM/DS tradeoff: a surface, ranking, or growth lever lifts a leading engagement input while threatening RPM, ARPU, chargebacks, or long-horizon retention. Learn to express a *constrained* objective, pre-register guardrails, separate short-window novelty from LTV, and run readouts the way strong experimentation orgs do—not as a single p-value on one chart.
Why This Is Not a 'Balance the Two' Question
Engagement (time-on-site, sessions, DAU) is a manipulable input on short horizons. Revenue and profit are often lagging and carry variance that a 7-day A/B is underpowered to pin down. A feed team can maximize watch-time in ways that destroy ad inventory quality, push down RPM, and increase refunds—sometimes while raising a naive CTR.
Strong interview answers refuse a single-hero-metric story. They define an OEC (overall evaluation criterion) with explicit guardrails—a maximum acceptable regression on ARPU, refund rate, trust-and-safety incidents, and latency—and they talk about horizon: the experiment duration needed before you would even expect to see certain harms. Weak answers say "we will monitor revenue" without a stopping rule, a duration, and an owner.
Depth Ladder — Mid, Senior, Staff
Mid: names primary and secondary metrics; "we should look at both engagement and money."
Senior: pre-registered guardrails; segment-level ARPU; discusses novelty and underpowered revenue; suggests CUPED or longer run for precision.
Staff: organizational incentives and cannibalization between surfaces; ecosystem (creator supply, ad market depth); defines governance—how exec exceptions work when a primary lifts but a guardrail wiggles; ties to holdout and long-run LTO (learning-to-optimize) failure modes.
Goodhart, Gaming, and the Metric Stack
Goodhart's law is not philosophy club—it is a release-review reality. If PMs or ML teams are bonused on a single engagement scalar, the system (human + model) will find degenerate equilibria: endless notifications, low-quality recency bias, or dark-pattern friction removal that increases short clicks and refunds.
Production orgs answer with a stack: a primary (often a growth or experience objective), 2–4 guardrails with redlines ("do not launch if..."), and diagnostic metrics to explain mechanism (ad CTR, ad depth, time-to-first-ad, return rate, chargeback rate, CS contacts). The interview wants you to name a mechanism when engagement and money diverge—composition (low-value time), auction pressure, or ad relevance—not vibes.
Cannibalization, Surfaces, and the Portfolio Problem
A single-surface win can be a portfolio loss: more time in one tab that steals from higher-RPM moments, a checkout experiment that steers bad margin SKUs, or a ranking change that overweights cheap UGC and hollows the premium supply your ads target.
Staff candidates say cannibalization out loud and ask whether the A/B is isolated or portfolio-interfering. If the same user sees multiple experiments or multiple ranking objectives, the SUTVA (stable unit treatment value assumption) story breaks; you need stricter readouts, switchback-style or stratified designs for network-heavy products, and longer washout when treatment effects are sticky.
Guardrail-First Readout — Steps You Can Narrate in 8 Minutes
Clarify the decision class
Ship a new UX, reweight a ranker, change ad density, or reprice? Who is exposed—all users, new only, a country?
Pre-specify OEC + guardrails + duration
One primary. Guardrails: ARPU or RPM, refund or chargeback, D7/D30 if that is the pre-registered retention proxy, p95 latency. Pre-registration beats fishing in 20 metrics.
Check experiment hygiene
SRM on assignment, novelty/learning effects (early lift may invert), and whether revenue metrics have enough events in the test window. Mention CUPED for precision when the baseline is stable.
Decompose divergence
If engagement ↑ and ARPU ↓, split by inventory type, geography, and tenure (whale vs long tail). A flat ARPU in aggregate can hide harm in whales—often the revenue story finance cares about.
Quantify the trade in business terms
Convert to expected revenue range per 1000 users, not a fake point estimate, and compare to a rollout cost and reversibility.
Decision rule
Default: no broad ship if a pre-registered guardrail crosses a *redline. Escalation path for exec exception with written risk ownership. Follow-up: targeted redesign, constrained ranker, or longer test.
If you launch under uncertainty
Staged % rollout, kill criteria on guardrails, and a champion model that reverts automatically when thresholds trip.
Guardrail-First Readout (Conceptual — Not a Single p-Value Game)
Goodhart Loop — Incentive → Shortcut → Ecosystem Harm
Worked — +5% Time on Site, -2% RPM, Both 'Significant' at 7d
Hygiene first: SRM clean; exposure balanced; the RPM drop is not a tiny-CID artifact only.
Mechanism check: not just "ads bad"—I split by ad depth, inventory mix, and sponsored vs organic engagement. A common story: the redesign increases time in low-yield inventory*; RPM falls because the quality of impressions drops even as raw time rises.
Duration honesty: 7d may be enough to see the RPM direction; 28d may be required for refund and repeat purchase. I say: "We are confident in short-revenue harm direction; we are underpowered on 28d LTV—default no broad ship unless an exec takes named risk and we have staged rollout with kill on RPM."
Impresses: a constrained second experiment that directly optimizes session revenue or ad relevance subject to time-on-site floor—not a retread of the same UX.
Primary Up / Guardrail Down — What to Inspect (Mechanism Map)
| Pattern | What likely broke (mechanism) | Next cuts for evidence |
|---|---|---|
| Engagement up, RPM down | Lower-quality inventory time; ad relevance | Sponsored CTR, eCPM, ad depth by page |
| Faster checkout, more GMV, lower profit | Margin mix, promo abuse | SKU mix, return rate, fraud signals |
| Feed time up, D30 ARPU flat or down | Time shifted to non-monetizable content | Revenue per minute of attention by category |
| Push/notification lift, later unsub spike | Fatigue, permission churn | Opt-out curve, 28d message volume |
| Aggressive rec for watch time, creator exodus 90d out | Supply ecosystem harm | Creator revenue share, upload trend by cohort |
| Latency regresses — engagement still up on slow networks | You may be *hiding* user harm in p50 only | p95, region slices, and guard on Core Web Vitals |
CUPED, Power, and What You Can Honest-Claim in a Week
CUPED (Deng, Xu, et al. at Microsoft) reduces variance in A/B by using a pre-period covariate; it is standard vocabulary for senior DS. The interview point is not the formula—it is: "We should not pretend a noisy 7d ARPU with wide CIs is a license to ship if the guardrail is a safety constraint."
What impresses: you separate statistical significance on a short window from business sufficiency. If ARPU is a hard guardrail, you either extend the test, raise power with CUPED/covariates, or reframe the launch as staged with automatic rollback. You do not hand-wave "we will watch monthly ARPU" without an owner and a trigger.
No-Ship and Maybe-Ship Triggers (Language That Sounds Like a Grown-Up Org)
Hard no-ship (typical): any pre-registered guardrail crosses a redline (ARPU, chargebacks, refund, latency SLO) with enough power to trust directionally—not necessarily z=3 on 7d, but a consistent story + mechanism.
Maybe-ship (with exec sign-off): primary strong, guardrail borderline noisy—staged launch to 5% for two weeks, champion path that auto-reverts, named business owner for downside.
Iterate-first: primary up but mechanism shows toxic engagement—redesign a variant that preserves the engagement lift in *high-RPM moments only.
Portfolio conflict:** another team’s metric moves because of your test—synchronize readouts, cluster or switchback for interference-heavy surfaces.
Anti-Patterns in Real Readouts
-
P-hacking across 20 metrics without FDR or pre-registration—your "5% false positive" is a fiction.
-
Novelty lift in week one treated as steady-state—many UX wins invert when learning effects burn off.
-
'We will monitor' without kill switch and SLO is how revenue harm lands in Q3.
One Staff-Level Sentence
"I don't maximize engagement. I maximize a pre-registered business objective subject to explicit revenue, trust, and latency constraints, with a stated horizon for each—because the shortcut that wins the 7d chart is often the one that taxes LTV in week six."