Skip to main content
SQL & Databases·Beginner

SQL Foundations for Data & ML Interviews: JOINs, Aggregations, and Window Functions

Master the SQL patterns that appear in every data and ML engineer interview — JOINs, GROUP BY, CTEs, and NULL semantics. This guide covers what most tutorials skip: why LEFT JOIN silently drops rows, when HAVING beats WHERE, and how NULL poisons aggregates in ways that hide bugs for months.

35 min read 10 sections 7 interview questions
SQLJOINsGROUP BYAggregationsCTEsSubqueriesNULL HandlingHAVINGINNER JOINLEFT JOINData EngineeringAnalytics

Why SQL Foundations Trip Up Strong Engineers

SQL looks easy until you hit a production bug traced back to a LEFT JOIN dropping rows you needed, a COUNT(*) silently including NULLs you meant to exclude, or a subquery running 10,000 times because it was correlated. These aren't academic gotchas — they're exactly what data and ML engineer interviews test.

The interview goal isn't to check if you know SELECT syntax. It's to see whether you think like a data engineer: can you predict what a query returns before running it? Can you spot the off-by-one in a retention cohort query? Can you explain why a GROUP BY query returned more rows than expected?

This guide focuses on the mechanics that produce correct results first, and fast results second.

IMPORTANT

What Separates a 9/10 Answer from a 6/10 Answer

A 6/10 candidate writes queries that work on the happy path. A 9/10 candidate asks: "What happens when this column has NULLs? What happens when there's no match on the right side of the JOIN? What happens when a user appears in multiple cohorts?"

In interviews, always narrate your NULL handling and your JOIN direction choice. Interviewers at Meta, Airbnb, and Stripe specifically probe whether you know that NULL = NULL is FALSE in SQL — and that this changes your LEFT JOIN behavior fundamentally.

SQL JOIN Types — Which Rows Survive

Rendering diagram...

JOIN Mechanics: What Actually Happens

A JOIN is a filter on a Cartesian product. SQL conceptually generates all row combinations from both tables, then applies the ON condition to keep only matching pairs.

INNER JOIN keeps rows where the ON condition is TRUE. LEFT JOIN keeps all left rows — even those with no right match — filling right columns with NULL. RIGHT JOIN is LEFT JOIN with tables swapped. Never use it — swap your table order instead. FULL OUTER JOIN keeps everything, NULLs on both sides for unmatches.

The anti-join pattern — "find all users who have never placed an order":

Wrong: uses INNER JOIN, returns users WITH orders: SELECT u.user_id FROM users u INNER JOIN orders o ON u.user_id = o.user_id

Correct: LEFT JOIN, then filter for no match: SELECT u.user_id FROM users u LEFT JOIN orders o ON u.user_id = o.user_id WHERE o.order_id IS NULL

The WHERE clause runs AFTER the JOIN, filtering for left-side rows that had no match. This anti-join pattern appears in cohort analysis, churn detection, and feature engineering.

JOINs in Practice: Anti-Join, Self-Join, and CTEs

sqljoins.sql
-- Anti-join: users who have NEVER placed an order
-- Use LEFT JOIN + WHERE right IS NULL, not NOT IN (breaks on NULLs)
SELECT
  u.user_id,
  u.email,
  u.created_at
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.order_id IS NULL;

-- Why NOT IN breaks with NULLs:
-- SELECT user_id FROM users WHERE user_id NOT IN (SELECT user_id FROM orders)
-- If orders.user_id contains even ONE NULL, this returns zero rows.
-- NULL compared to anything is UNKNOWN, so NOT IN with NULLs is always UNKNOWN.

-- CTE version of a self-join referral chain (prefer CTEs for readability)
WITH referral_pairs AS (
  SELECT
    referred_by AS referrer_id,
    user_id     AS referred_id
  FROM users
  WHERE referred_by IS NOT NULL
)
SELECT
  r1.referrer_id  AS level1,
  r1.referred_id  AS level2,
  r2.referred_id  AS level3
FROM referral_pairs r1
JOIN referral_pairs r2 ON r2.referrer_id = r1.referred_id;

-- Cohort retention query using CTEs (4 named steps, fully debuggable)
WITH user_cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
  FROM user_activity
  GROUP BY user_id
),
monthly_activity AS (
  SELECT
    a.user_id,
    uc.cohort_month,
    DATE_TRUNC('month', a.activity_date)          AS activity_month,
    DATEDIFF('month', uc.cohort_month,
      DATE_TRUNC('month', a.activity_date))        AS months_since_join
  FROM user_activity a
  JOIN user_cohorts uc ON a.user_id = uc.user_id
),
cohort_sizes AS (
  SELECT cohort_month, COUNT(*) AS cohort_size
  FROM user_cohorts
  GROUP BY cohort_month
)
SELECT
  ma.cohort_month,
  ma.months_since_join,
  COUNT(DISTINCT ma.user_id)::float / cs.cohort_size AS retention_rate
FROM monthly_activity ma
JOIN cohort_sizes cs USING (cohort_month)
GROUP BY ma.cohort_month, ma.months_since_join, cs.cohort_size
ORDER BY ma.cohort_month, ma.months_since_join;

GROUP BY and HAVING: The Aggregation Layer

GROUP BY collapses all rows with the same key into a single output row. Every column in SELECT must either be in GROUP BY or wrapped in an aggregate (COUNT, SUM, AVG, MAX, MIN).

The SQL execution order determines what is possible where: FROM > JOIN > WHERE > GROUP BY > HAVING > SELECT > ORDER BY > LIMIT

WHERE runs before grouping — it filters individual rows. HAVING runs after grouping — it filters aggregate results.

Use HAVING to filter on aggregate values: HAVING COUNT(*) > 5 (users with more than 5 orders) HAVING SUM(revenue) > 10000 (accounts with high total revenue)

You cannot use HAVING to filter on non-aggregated columns not in GROUP BY — use WHERE for that.

Common interview question: "Find products ordered more than 100 times, only by US users." WHERE user_country = 'US' goes before GROUP BY; HAVING COUNT(*) > 100 goes after. Mixing them up returns wrong counts.

WHERE vs HAVING — When to Use Each

ScenarioUse WHEREUse HAVING
Filter individual rows before groupingYes — runs before GROUP BYNo — would re-filter grouped results
Filter by aggregate result (count > 10)No — aggregates don't exist yetYes — runs after GROUP BY
Filter by a column not in SELECTYes — any column worksOnly if it's also in GROUP BY
Exclude NULL values from analysisYes — WHERE col IS NOT NULLPossible but less efficient
Find groups with no members (0 count)Not possible aloneHAVING COUNT(*) = 0 after LEFT JOIN

Subqueries vs CTEs: Always Use CTEs

A subquery is a query nested inside another query. A CTE (Common Table Expression) is a named subquery defined in a WITH block. Both produce the same results — CTEs should be your default.

CTEs win on every dimension:

  • Readability: named, modular, top-down flow
  • Debuggability: each CTE can be tested independently
  • Reusability: a CTE defined once can be referenced multiple times in the same query
  • Performance: modern optimizers treat CTEs and subqueries identically (PostgreSQL 12+, BigQuery, Snowflake)

The one case to avoid entirely: correlated subqueries. A correlated subquery references a column from the outer query, causing it to execute once per outer row — O(n squared) cost. Always rewrite correlated subqueries as JOINs, CTEs, or window functions.

⚠ WARNING

NULL Traps That Cause Silent Data Loss

NULL in SQL is not zero or empty string — it is the absence of a value. Its behavior is counterintuitive and causes production bugs constantly.

COUNT(*) counts all rows including NULLs. COUNT(column) counts only non-NULL values. These return different numbers.

AVG(column) ignores NULLs — the denominator is only non-NULL rows. SUM(column) ignores NULLs but returns NULL if ALL values are NULL.

NULL in a GROUP BY key: NULLs are grouped together (treated as equal for grouping only). NULL in a WHERE condition: NULL = 5 is UNKNOWN — the row is excluded.

Safe patterns: COALESCE(column, 0) replaces NULL with 0 before aggregating. NULLIF(value, 0) converts 0 to NULL — prevents division-by-zero. WHERE col IS NOT NULL before grouping to exclude NULL groups explicitly.

TIP

Interview Answer Template for SQL Questions

When asked a SQL question in an interview, verbalize this checklist before writing:

  1. What is the grain of the result? (one row per user? per order? per day?)
  2. What JOIN type do I need? (inner drops unmatched rows — is that what I want?)
  3. Are there NULLs in join keys or aggregated columns? (how do I handle them?)
  4. Does my WHERE filter go before or after grouping? (WHERE vs HAVING)
  5. Is there a correlated subquery I can replace with a window function or CTE?

Narrating this checklist out loud turns "got the right answer" into a "would hire immediately" signal. Systematically thinking through data shapes is what separates data engineers from people who write SQL.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →