SQL Foundations for Data & ML Interviews: JOINs, Aggregations, and Window Functions
Master the SQL patterns that appear in every data and ML engineer interview — JOINs, GROUP BY, CTEs, and NULL semantics. This guide covers what most tutorials skip: why LEFT JOIN silently drops rows, when HAVING beats WHERE, and how NULL poisons aggregates in ways that hide bugs for months.
Why SQL Foundations Trip Up Strong Engineers
SQL looks easy until you hit a production bug traced back to a LEFT JOIN dropping rows you needed, a COUNT(*) silently including NULLs you meant to exclude, or a subquery running 10,000 times because it was correlated. These aren't academic gotchas — they're exactly what data and ML engineer interviews test.
The interview goal isn't to check if you know SELECT syntax. It's to see whether you think like a data engineer: can you predict what a query returns before running it? Can you spot the off-by-one in a retention cohort query? Can you explain why a GROUP BY query returned more rows than expected?
This guide focuses on the mechanics that produce correct results first, and fast results second.
What Separates a 9/10 Answer from a 6/10 Answer
A 6/10 candidate writes queries that work on the happy path. A 9/10 candidate asks: "What happens when this column has NULLs? What happens when there's no match on the right side of the JOIN? What happens when a user appears in multiple cohorts?"
In interviews, always narrate your NULL handling and your JOIN direction choice. Interviewers at Meta, Airbnb, and Stripe specifically probe whether you know that NULL = NULL is FALSE in SQL — and that this changes your LEFT JOIN behavior fundamentally.
SQL JOIN Types — Which Rows Survive
JOIN Mechanics: What Actually Happens
A JOIN is a filter on a Cartesian product. SQL conceptually generates all row combinations from both tables, then applies the ON condition to keep only matching pairs.
INNER JOIN keeps rows where the ON condition is TRUE. LEFT JOIN keeps all left rows — even those with no right match — filling right columns with NULL. RIGHT JOIN is LEFT JOIN with tables swapped. Never use it — swap your table order instead. FULL OUTER JOIN keeps everything, NULLs on both sides for unmatches.
The anti-join pattern — "find all users who have never placed an order":
Wrong: uses INNER JOIN, returns users WITH orders: SELECT u.user_id FROM users u INNER JOIN orders o ON u.user_id = o.user_id
Correct: LEFT JOIN, then filter for no match: SELECT u.user_id FROM users u LEFT JOIN orders o ON u.user_id = o.user_id WHERE o.order_id IS NULL
The WHERE clause runs AFTER the JOIN, filtering for left-side rows that had no match. This anti-join pattern appears in cohort analysis, churn detection, and feature engineering.
JOINs in Practice: Anti-Join, Self-Join, and CTEs
-- Anti-join: users who have NEVER placed an order
-- Use LEFT JOIN + WHERE right IS NULL, not NOT IN (breaks on NULLs)
SELECT
u.user_id,
u.email,
u.created_at
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.order_id IS NULL;
-- Why NOT IN breaks with NULLs:
-- SELECT user_id FROM users WHERE user_id NOT IN (SELECT user_id FROM orders)
-- If orders.user_id contains even ONE NULL, this returns zero rows.
-- NULL compared to anything is UNKNOWN, so NOT IN with NULLs is always UNKNOWN.
-- CTE version of a self-join referral chain (prefer CTEs for readability)
WITH referral_pairs AS (
SELECT
referred_by AS referrer_id,
user_id AS referred_id
FROM users
WHERE referred_by IS NOT NULL
)
SELECT
r1.referrer_id AS level1,
r1.referred_id AS level2,
r2.referred_id AS level3
FROM referral_pairs r1
JOIN referral_pairs r2 ON r2.referrer_id = r1.referred_id;
-- Cohort retention query using CTEs (4 named steps, fully debuggable)
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM user_activity
GROUP BY user_id
),
monthly_activity AS (
SELECT
a.user_id,
uc.cohort_month,
DATE_TRUNC('month', a.activity_date) AS activity_month,
DATEDIFF('month', uc.cohort_month,
DATE_TRUNC('month', a.activity_date)) AS months_since_join
FROM user_activity a
JOIN user_cohorts uc ON a.user_id = uc.user_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(*) AS cohort_size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
ma.cohort_month,
ma.months_since_join,
COUNT(DISTINCT ma.user_id)::float / cs.cohort_size AS retention_rate
FROM monthly_activity ma
JOIN cohort_sizes cs USING (cohort_month)
GROUP BY ma.cohort_month, ma.months_since_join, cs.cohort_size
ORDER BY ma.cohort_month, ma.months_since_join;
GROUP BY and HAVING: The Aggregation Layer
GROUP BY collapses all rows with the same key into a single output row. Every column in SELECT must either be in GROUP BY or wrapped in an aggregate (COUNT, SUM, AVG, MAX, MIN).
The SQL execution order determines what is possible where: FROM > JOIN > WHERE > GROUP BY > HAVING > SELECT > ORDER BY > LIMIT
WHERE runs before grouping — it filters individual rows. HAVING runs after grouping — it filters aggregate results.
Use HAVING to filter on aggregate values: HAVING COUNT(*) > 5 (users with more than 5 orders) HAVING SUM(revenue) > 10000 (accounts with high total revenue)
You cannot use HAVING to filter on non-aggregated columns not in GROUP BY — use WHERE for that.
Common interview question: "Find products ordered more than 100 times, only by US users." WHERE user_country = 'US' goes before GROUP BY; HAVING COUNT(*) > 100 goes after. Mixing them up returns wrong counts.
WHERE vs HAVING — When to Use Each
| Scenario | Use WHERE | Use HAVING |
|---|---|---|
| Filter individual rows before grouping | Yes — runs before GROUP BY | No — would re-filter grouped results |
| Filter by aggregate result (count > 10) | No — aggregates don't exist yet | Yes — runs after GROUP BY |
| Filter by a column not in SELECT | Yes — any column works | Only if it's also in GROUP BY |
| Exclude NULL values from analysis | Yes — WHERE col IS NOT NULL | Possible but less efficient |
| Find groups with no members (0 count) | Not possible alone | HAVING COUNT(*) = 0 after LEFT JOIN |
Subqueries vs CTEs: Always Use CTEs
A subquery is a query nested inside another query. A CTE (Common Table Expression) is a named subquery defined in a WITH block. Both produce the same results — CTEs should be your default.
CTEs win on every dimension:
- Readability: named, modular, top-down flow
- Debuggability: each CTE can be tested independently
- Reusability: a CTE defined once can be referenced multiple times in the same query
- Performance: modern optimizers treat CTEs and subqueries identically (PostgreSQL 12+, BigQuery, Snowflake)
The one case to avoid entirely: correlated subqueries. A correlated subquery references a column from the outer query, causing it to execute once per outer row — O(n squared) cost. Always rewrite correlated subqueries as JOINs, CTEs, or window functions.
NULL Traps That Cause Silent Data Loss
NULL in SQL is not zero or empty string — it is the absence of a value. Its behavior is counterintuitive and causes production bugs constantly.
COUNT(*) counts all rows including NULLs. COUNT(column) counts only non-NULL values. These return different numbers.
AVG(column) ignores NULLs — the denominator is only non-NULL rows. SUM(column) ignores NULLs but returns NULL if ALL values are NULL.
NULL in a GROUP BY key: NULLs are grouped together (treated as equal for grouping only). NULL in a WHERE condition: NULL = 5 is UNKNOWN — the row is excluded.
Safe patterns: COALESCE(column, 0) replaces NULL with 0 before aggregating. NULLIF(value, 0) converts 0 to NULL — prevents division-by-zero. WHERE col IS NOT NULL before grouping to exclude NULL groups explicitly.
Interview Answer Template for SQL Questions
When asked a SQL question in an interview, verbalize this checklist before writing:
- What is the grain of the result? (one row per user? per order? per day?)
- What JOIN type do I need? (inner drops unmatched rows — is that what I want?)
- Are there NULLs in join keys or aggregated columns? (how do I handle them?)
- Does my WHERE filter go before or after grouping? (WHERE vs HAVING)
- Is there a correlated subquery I can replace with a window function or CTE?
Narrating this checklist out loud turns "got the right answer" into a "would hire immediately" signal. Systematically thinking through data shapes is what separates data engineers from people who write SQL.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →