Sections

An 8-week structured roadmap for Data Scientist interview preparation — SQL to Stats to Analytics to ML to Scenario questions. Covers what FAANG DS loops actually test, why most candidates fail on window functions and A/B testing math rather than ML theory, and how to identify whether you should apply for DS vs MLE vs Analytics Engineer roles.

15 min read 9 sections 5 interview questions

Data ScientistInterview PrepSQLStatisticsA/B TestingAnalyticsMachine LearningStudy PlanFAANG InterviewCohort AnalysisGradient BoostingFeature EngineeringCareer Path

Why the Ordering of This Path Matters

Most DS candidates study in the wrong order: they jump to ML because it feels most relevant to the job title, then scramble through SQL and stats in the last week before their loop. That strategy fails at FAANG — not because ML isn't tested, but because SQL and statistics are tested more frequently and show up in every single round, while deep ML theory appears in only one or two.

The correct ordering is SQL → Stats → Analytics → ML → Scenarios, and each stage builds on the last:

SQL first because it's the highest-leverage, fastest-to-improve skill. A candidate who does two focused weeks on SQL will outperform one who has spent six weeks on ML but can't write a window function. SQL also appears in take-home assessments, phone screens, and full loops — it has the highest return on prep time.
Stats before ML because ML without statistics is memorizing formulas without understanding them. If you don't know why p-values are not the probability the null hypothesis is true, you cannot meaningfully answer A/B test result questions — and those appear in every DS loop at Facebook, Google, and Uber.
Analytics before ML because DS interviews at most companies test product sense and metric interpretation at least as heavily as model building. Knowing how to diagnose a DAU drop or perform a funnel analysis is non-negotiable for DS roles, and these are consistently the most-failed sections among candidates with strong ML backgrounds.
ML before Scenarios because scenario questions assume you can reason about models, features, and evaluation — you cannot fake that reasoning without the ML foundation.
Scenarios last because they synthesize everything: a scenario question like "our 7-day retention dropped 4% after a product change — walk me through your investigation" requires SQL fluency, statistical intuition, product analytics reasoning, and light ML instinct all at once.

Total time budget: 8 weeks for a candidate prepping from a moderate baseline. Candidates with strong SQL or stats backgrounds can compress the relevant stages to 1 week each.

IMPORTANT

What DS Interviewers Are Actually Testing at Each Stage

The common misconception: DS interviews test ML depth. The reality at FAANG:

Phone screen (30–45 min): Almost always SQL + one statistics concept. A weak SQL answer here ends the loop immediately, regardless of ML knowledge.

Take-home (if applicable): SQL query + EDA + one modeling task. Graders evaluate code cleanliness, whether you explained your reasoning, and whether your model evaluation is correct — not whether you used the most sophisticated model.

Full loop — typical DS breakdown:

Round 1: SQL (always present)
Round 2: Statistics / A/B testing (always present)
Round 3: Product analytics / case study (always present)
Round 4: ML breadth (supervised learning, evaluation, feature engineering)
Round 5: Behavioral / leadership (L5+)

At L4/L5, interviewers are testing: Can you correctly answer a metric drop question under time pressure? Can you design an A/B test with the right sample size? Can you identify the confounds? At L5+, they additionally test: Can you go beyond what the interviewer asked and surface the non-obvious insight?

The 5-Stage Data Scientist Prep Path

Stage 1 — SQL (Week 1)

Master window functions (ROW_NUMBER, RANK, LAG/LEAD, running totals with ROWS BETWEEN), CTEs for multi-step logic, GROUP BY + HAVING, self-joins, and NULL traps (COUNT(*) vs COUNT(col), WHERE after LEFT JOIN converting to INNER JOIN semantics). The non-obvious failure point: most DS candidates fail on window functions + PARTITION BY + running totals, not on basic SELECTs. If you can write a 7-day rolling retention query using LAG() and DATEDIFF without looking anything up, you're ready.

Stage 2 — Statistics (Weeks 2–3)

Cover A/B testing end-to-end: null hypothesis, alternative hypothesis, Type I and Type II errors, p-value interpretation (it is NOT the probability the null is true — this misconception kills candidates), power, sample size calculation (n = 16σ²/Δ² for two-tailed at 80% power / α=0.05), confidence intervals, and t-tests. Also cover: chi-squared vs t-test (chi-squared for proportions/categorical outcomes; t-test for continuous means), Bayesian vs frequentist framing, and the Gaussian / Poisson / Bernoulli distributions. Must be able to calculate sample size from scratch given effect size and current metric.

Stage 3 — Analytics (Week 4)

Cohort analysis (define cohort → compute activation → track retention at Day 1/7/28/90), funnel metrics (conversion rate per step, where users drop, statistical significance of funnel changes), DAU/MAU ratio as an engagement indicator, attribution models (last-touch, first-touch, linear, data-driven — and why last-touch overstates bottom-funnel channels). The critical skill: metric drop diagnosis. Given 'DAU dropped 8% yesterday,' walk through: seasonality check, segment breakdown (platform, region, cohort), funnel step isolation, data pipeline validation. Most candidates jump to hypotheses before checking if the data is even correct.

Stage 4 — Machine Learning (Weeks 5–7)

DS interviews test breadth + intuition, not deep theory. Cover supervised learning (logistic regression as a strong baseline, decision trees, random forests, gradient boosting / XGBoost — know that XGBoost beats random forests on most tabular tasks because it learns from residuals iteratively, not independently). Model evaluation: ROC/AUC is threshold-agnostic and right for ranking; precision/recall is right when class imbalance matters; F1 when you need a single metric; log-loss when you care about calibration. Feature engineering: encoding categoricals (one-hot for low-cardinality, target encoding for high-cardinality with regularization), handling missing values (mean/mode imputation vs indicator + fill vs model-based), and leakage detection.

Stage 5 — Scenario Questions (Week 8)

Scenario questions combine everything. The most common formats: (1) 'Our metric X dropped — walk me through your investigation,' (2) 'This A/B test shows conflicting results across metrics — what do you recommend?' (3) 'How would you build a model to predict X?' For metric drop questions, always start with data validation before business hypothesis — at least 30% of all 'metric drops' in production are data pipeline errors, not actual drops. For A/B test result questions, check for Sample Ratio Mismatch (SRM) first — if treatment and control sizes don't match the assigned split ratio, the experiment is invalid regardless of the p-value.

8-Week Time Allocation and Priority by Stage

Stage	Topic	Duration	Focus Areas	Common Failure Mode
Stage 1	SQL	Week 1	Window functions, CTEs, self-joins, NULL traps	Candidates only practice basic SELECT/JOIN; fail on PARTITION BY + running totals
Stage 2	Statistics	Weeks 2–3	A/B testing, sample size, p-value, t-test vs chi-squared, distributions	Misdefine p-value; cannot calculate sample size from scratch
Stage 3	Analytics	Week 4	Cohort analysis, funnel metrics, DAU/MAU, metric drop diagnosis	Skip data validation step; jump straight to business hypotheses
Stage 4	Machine Learning	Weeks 5–7	Gradient boosting, ROC/AUC vs precision/recall, feature engineering, leakage	Study ML depth but skip evaluation nuance; confuse AUC with accuracy
Stage 5	Scenarios	Week 8	Metric drop, A/B result interpretation, model design end-to-end	Give technically correct but non-actionable answers; miss SRM check

The SQL Specifics That Separate DS Candidates

DS SQL interviews are not testing whether you know SELECT. They're testing whether you can translate a business question into a correct, readable, efficient query — under time pressure and while explaining your reasoning out loud.

The window function gap: In a typical DS SQL interview cohort, roughly 60% of candidates struggle with window functions. The patterns you must be fluent in:

Running totals: SUM(revenue) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
Row comparison: Using LAG(metric, 1) to compute day-over-day change without a self-join
Deduplication: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) with WHERE rn = 1 to get the most recent record per user
Top-N per group: DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) with WHERE dr <= 3

The NULL trap: After a LEFT JOIN, filtering in WHERE on the right-side table columns converts it to an INNER JOIN because NULL = value evaluates to NULL (not FALSE), which WHERE treats as false. Interviewers specifically test this. Move right-side filters to the ON clause.

The GROUP BY + window hybrid: The question "find users who spent more than the average for their country" requires a subquery or window function — you cannot reference a window function in a WHERE clause directly. Use a CTE:

WITH avg_by_country AS (
    SELECT user_id, country,
           spend,
           AVG(spend) OVER (PARTITION BY country) AS country_avg
    FROM users
)
SELECT user_id FROM avg_by_country WHERE spend > country_avg;

The Statistics Specifics That Separate DS Candidates

A/B testing — what FAANG interviewers actually ask:

Sample size calculation: Given a baseline conversion rate of 5%, a desired minimum detectable effect of 0.5% absolute (10% relative), 80% power, and α=0.05 (two-tailed), how many users per variant do you need? The formula: n = (z_α/2 + z_β)² × 2p(1-p) / Δ². For these values: z_α/2 = 1.96, z_β = 0.84, p = 0.05, Δ = 0.005. Result: ~24,400 users per variant. If you can't derive this, that's a gap to close before your loop.
The p-value misconception: p-value is the probability of observing this result (or more extreme) assuming the null is true. It is NOT the probability that the null hypothesis is true, and it is NOT the probability that your result is due to chance. Stating the correct definition clearly in an interview signals genuine statistical literacy.
Chi-squared vs t-test: Use chi-squared when both variables are categorical (e.g., click vs no-click × treatment vs control). Use t-test when comparing means of continuous outcomes. Most DS interviews use binary metrics (clicked, converted, churned), which means chi-squared or z-test for proportions, not t-test.
p-hacking / peeking: Checking results daily and stopping the experiment when p < 0.05 inflates the false positive rate far above 5%. Sequential testing methods (e.g., mSPRT) solve this. FAANG interviewers specifically probe whether candidates know this problem exists.

DS Role vs MLE Role vs Analytics Engineer — What Each Interview Tests

Dimension	Data Scientist	ML Engineer	Analytics Engineer
Primary skill emphasis	SQL + Stats + Product analytics + ML breadth	ML systems + model architecture + distributed training + serving	SQL + dbt + data modeling + dashboard design
SQL depth	High — medium-hard window functions, multi-step CTEs	Moderate — knows it but not the primary signal	Very high — production-grade query optimization, data pipeline SQL
Statistics depth	High — A/B testing, sample size, distributions	Moderate — knows fundamentals	Low to moderate
ML depth	Breadth + intuition over supervised learning	Deep — transformers, embeddings, training infrastructure	Low — knows when to use ML, doesn't build models
System design	Light — data pipeline awareness	Heavy — ML system design, feature stores, serving infra	Moderate — data warehouse design, ELT pipelines
Typical at FAANG	Meta DS, Google DS, Uber DS	Meta MLE, Google SWE-ML, Amazon ML Scientist	Airbnb AE, Stripe DE, Shopify AE
Biggest mistake	Applying for MLE roles without systems knowledge	Applying to DS roles and underselling analytics fluency	Applying to DS roles without ML interview prep

⚠ WARNING

The Gaps That Cause DS Candidates to Fail at FAANG

Gap 1 — Window function fluency: Candidates who only practice basic SQL fail the phone screen. The specific gap is always window functions + PARTITION BY — practice until you can write LAG(), ROW_NUMBER(), and running totals from memory.

Gap 2 — A/B test math under pressure: Interviewers ask you to calculate sample size or explain confidence intervals on the spot. Most candidates have conceptual understanding but cannot compute. Practice deriving sample size formulas without a calculator.

Gap 3 — Metric drop diagnosis ordering: Most candidates jump to business hypotheses before checking data validity. In production, 30–40% of apparent metric drops trace to data pipeline errors. Always state: "First I'd check if this is a real drop or a logging/pipeline issue" — this signals production experience.

Gap 4 — Applying for the wrong role: DS and MLE job descriptions overlap significantly, but the interviews are very different. DS interviews are analytics + stats heavy; MLE interviews are systems + ML architecture heavy. Candidates optimize for the wrong prep based on the job title alone — read the job description carefully and calibrate.

Gap 5 — Confusing AUC with accuracy: Saying "my model has 0.85 AUC" and "my model has 85% accuracy" signal completely different things, and conflating them in an interview signals shallow evaluation understanding. Know when each metric is appropriate and why.

TIP

Interview Readiness Signals: How to Know You're Prepared

You are ready for a FAANG DS phone screen when you can:

Write a 7-day rolling retention query from scratch using LAG() and window frames in under 5 minutes
Calculate A/B test sample size from the formula given baseline rate, MDE, power, and significance level
Correctly define p-value in one sentence without saying "probability that the result is due to chance"
Walk through metric drop diagnosis in order: data validation → segment breakdown → funnel → hypothesis

You are ready for the full DS loop when you can:

Explain why XGBoost outperforms random forests on most tabular tasks (sequential residual learning vs independent trees)
State when to use ROC/AUC vs precision/recall vs F1 vs log-loss with concrete examples of each
Design an attribution model from scratch and explain what's wrong with last-touch attribution at a company with long purchase consideration cycles
Give a non-hedged ship/no-ship recommendation on an A/B test with a conflicting primary metric and guardrail metric

Interview Questions

Click to reveal answers

Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.