Sections
Related Guides
Window Functions for Analytics: ROW_NUMBER, RANK, LAG/LEAD, and Running Totals
SQL & Databases
SQL Query Optimization: Indexes, Query Plans, and Performance at Scale
SQL & Databases
Subqueries and CTEs: WITH Clauses, Correlated Subqueries, and Recursive Patterns
SQL & Databases
A/B Testing & Experimentation at Scale
Machine Learning
SQL Joins & Aggregations: INNER, LEFT, GROUP BY, and Aggregation Pitfalls
Joins and GROUP BY form the backbone of every SQL interview. This guide covers the four join types with precise NULL semantics, GROUP BY with HAVING, the COUNT(*) vs COUNT(col) distinction that fails most candidates, multi-table aggregation patterns, and the self-join technique for hierarchical and comparison queries asked at Meta, Stripe, and Airbnb.
Why Joins and Aggregations Are the SQL Interview Foundation
Every SQL interview question — from "find users who have never ordered" to "compute revenue by category" — reduces to two operations: joining tables to combine related data, and aggregating rows to compute summaries. Master these two, and ~80% of SQL interview questions become straightforward.
The interview is not testing your ability to remember syntax. It is testing:
- Join type selection: knowing which join to use when — specifically when to use LEFT JOIN and when its absence is a bug (not an oversight)
- NULL propagation: understanding that LEFT JOIN produces NULLs in the right-table columns for unmatched rows, and using this deliberately for "find records with no match" patterns
- Aggregation correctness: knowing that COUNT(*) and COUNT(col) return different results, that HAVING filters after aggregation while WHERE filters before, and that GROUP BY must include all non-aggregated SELECT columns
At Meta data interviews, ~60% of questions require a LEFT JOIN to find gaps (users without purchases, events without conversions). At Stripe, multi-table aggregations with implicit join order determine whether you understand query plans.
What the Interviewer Is Actually Testing
Interviewers test three join-related signals:
-
Do you reach for LEFT JOIN when the question involves "users who have not done X"? Most candidates use INNER JOIN and miss all the nulls. LEFT JOIN + WHERE right_table.id IS NULL is the pattern — it appears in ~30% of hard SQL interview questions.
-
Do you know that COUNT(col) ignores NULLs but COUNT() counts all rows? The difference matters in LEFT JOIN results: COUNT(order_id) counts only matched users; COUNT() would count all users including unmatched ones with NULL order_ids.
-
Do you know when HAVING is required vs WHERE? WHERE runs before GROUP BY. HAVING runs after. Filtering on an aggregate (e.g., "users with more than 3 orders") requires HAVING, not WHERE.
SQL Join Type Decision Tree
INNER JOIN vs LEFT JOIN — The NULL Semantics That Matter
INNER JOIN returns only rows where the join condition matches in both tables. Any row in the left table with no matching row in the right table is dropped silently.
LEFT JOIN (LEFT OUTER JOIN) returns all rows from the left table. For rows with no match in the right table, all right-table columns are NULL.
The silent drop problem: "List all users and their total purchase amounts" sounds like an INNER JOIN query but it's not. Users with no purchases should appear with amount = 0 or NULL. An INNER JOIN returns only users who have purchased — a plausible but incorrect result.
Correct pattern: SELECT u.user_id, COALESCE(SUM(o.amount), 0) AS total_spend FROM users u LEFT JOIN orders o ON u.user_id = o.user_id GROUP BY u.user_id
The anti-join pattern ("users who have never ordered"): SELECT u.user_id, u.email FROM users u LEFT JOIN orders o ON u.user_id = o.user_id WHERE o.order_id IS NULL -- NULLs only exist for users with no matching orders
This is more efficient than NOT IN (vulnerable to NULLs) and NOT EXISTS (correct but verbose). The LEFT JOIN + IS NULL approach is the standard interview answer.
Join Patterns: Left Join, Anti-Join, and Self-Join
-- Pattern 1: All users with their order count (LEFT JOIN for users with 0 orders)
SELECT
u.user_id,
u.email,
COUNT(o.order_id) AS order_count, -- COUNT(col) ignores NULLs → correct count
COALESCE(SUM(o.amount), 0) AS total_spend
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, u.email
ORDER BY total_spend DESC;
-- Pattern 2: Anti-join — users who have NEVER placed an order
SELECT u.user_id, u.email, u.created_at
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.order_id IS NULL; -- only unmatched users have NULL here
-- Alternative: NOT EXISTS (equivalent, sometimes faster with the right index)
SELECT u.user_id, u.email
FROM users u
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);
-- Pattern 3: Self-join — find employees who earn more than their manager
SELECT
e.employee_id,
e.name AS employee_name,
e.salary AS employee_salary,
m.name AS manager_name,
m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id -- same table, aliased twice
WHERE e.salary > m.salary;
-- Pattern 4: Multi-table join — orders with product and user details
SELECT
o.order_id,
u.email,
p.product_name,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN users u ON o.user_id = u.user_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.status = 'completed'
AND o.created_at >= '2024-01-01'
ORDER BY o.order_id, oi.order_item_id;
GROUP BY and HAVING — Aggregation Mechanics
GROUP BY collapses all rows with the same grouping values into a single output row. Every column in the SELECT list must either be in GROUP BY or wrapped in an aggregate function (COUNT, SUM, AVG, MIN, MAX). Violating this rule is a syntax error in most databases.
Key rules:
- GROUP BY runs after WHERE but before SELECT
- HAVING filters groups after GROUP BY (equivalent to WHERE, but for aggregated results)
- You can GROUP BY an expression: GROUP BY DATE_TRUNC('month', created_at)
- You can GROUP BY multiple columns: GROUP BY user_id, product_category
COUNT() vs COUNT(col): COUNT() — counts all rows in the group, including those with NULLs in any column COUNT(col) — counts only rows where col is NOT NULL COUNT(DISTINCT col) — counts unique non-NULL values
This distinction matters in LEFT JOIN queries: COUNT(*) would count users with no orders (since there's still a row for them), while COUNT(order_id) correctly returns 0 for users with no orders because their order_id is NULL.
HAVING vs WHERE: WHERE event_type = 'purchase' → filters rows BEFORE grouping (only purchase rows enter GROUP BY) HAVING COUNT() > 3 → filters groups AFTER grouping (only groups with >3 rows kept) WHERE COUNT() > 3 → SYNTAX ERROR — aggregate functions are not allowed in WHERE
GROUP BY, HAVING, and Aggregation Patterns
-- Basic GROUP BY with HAVING (users with more than 3 purchases in last 90 days)
SELECT
user_id,
COUNT(*) AS purchase_count,
SUM(amount) AS total_spend,
AVG(amount) AS avg_order_value,
MAX(amount) AS largest_order,
MIN(created_at) AS first_purchase,
MAX(created_at) AS latest_purchase
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
AND status = 'completed' -- filter BEFORE grouping
GROUP BY user_id
HAVING COUNT(*) > 3 -- filter AFTER grouping
ORDER BY total_spend DESC;
-- Multi-dimensional GROUP BY: revenue by month and category
SELECT
DATE_TRUNC('month', created_at) AS revenue_month,
category,
COUNT(DISTINCT user_id) AS unique_buyers,
COUNT(order_id) AS total_orders,
SUM(amount) AS total_revenue,
SUM(amount) / COUNT(DISTINCT user_id) AS revenue_per_buyer
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY DATE_TRUNC('month', created_at), category
ORDER BY revenue_month, total_revenue DESC;
-- COUNT(*) vs COUNT(col) in a LEFT JOIN context
-- Correct: COUNT(order_id) = 0 for users with no orders
SELECT
u.user_id,
COUNT(o.order_id) AS orders_placed, -- 0 for users with no orders
COUNT(*) AS row_count -- 1 for every user (wrong for this use case)
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id;
-- GROUP BY with ROLLUP (subtotals + grand total — PostgreSQL, MySQL 8+, BigQuery)
SELECT
COALESCE(category, 'ALL CATEGORIES') AS category,
COALESCE(region, 'ALL REGIONS') AS region,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY ROLLUP(category, region)
ORDER BY category NULLS LAST, region NULLS LAST;
Aggregation Function Reference
| Function | What It Counts | NULL Behavior | Common Use Case |
|---|---|---|---|
| COUNT(*) | All rows in group | Includes rows where any col is NULL | Row count, order count per user |
| COUNT(col) | Rows where col is NOT NULL | Ignores NULLs | Count non-null events after LEFT JOIN |
| COUNT(DISTINCT col) | Unique non-NULL values of col | Ignores NULLs | Unique users, unique products |
| SUM(col) | Sum of non-NULL values | Ignores NULLs | Total revenue, total spend |
| AVG(col) | Average of non-NULL values | Ignores NULLs — divides by non-NULL count | Average order value |
| MAX(col) / MIN(col) | Extreme value in group | Ignores NULLs | Latest event date, highest score |
Self-Joins: Hierarchical Queries and Comparison Patterns
A self-join joins a table to itself using two different aliases. This is required for:
- Hierarchical queries: employees and their managers (same employees table)
- Comparison queries: find rows where a value is greater than another row's value in the same table
- Consecutive pair detection: sessions where event N is immediately followed by event M
The pattern: alias the same table twice and join on the relationship column.
Employee hierarchy: FROM employees e JOIN employees m ON e.manager_id = m.employee_id
The e alias represents the employee, the m alias represents the manager. Both come from the same table. You can now compare e.salary to m.salary, or select m.name as manager_name.
Common interview variant: "Find all employees whose salary exceeds the average salary in their own department." This requires a self-join or correlated subquery — GROUP BY alone cannot compare a row's value to its own group's aggregate.
Self-join solution: SELECT e.employee_id, e.name, e.salary, dept_avg.avg_salary FROM employees e JOIN ( SELECT department_id, AVG(salary) AS avg_salary FROM employees GROUP BY department_id ) dept_avg ON e.department_id = dept_avg.department_id WHERE e.salary > dept_avg.avg_salary
Join Mistakes That Kill SQL Interviews
Mistake 1: INNER JOIN when LEFT JOIN is required. "List all users with their last login" — if some users have never logged in, INNER JOIN drops them. The result looks complete but is wrong. Develop the habit: any time the question says "all X, even those without Y", reach for LEFT JOIN.
Mistake 2: Forgetting WHERE right_table.id IS NULL in anti-joins. A LEFT JOIN alone returns all users including those with matches. You need the IS NULL filter to isolate unmatched rows.
Mistake 3: WHERE with an aggregate function. "Find users with more than 10 orders" written as WHERE COUNT(orders) > 10 is a syntax error. Move it to HAVING.
Mistake 4: Grouping without including all non-aggregated SELECT columns in GROUP BY. If you SELECT user_id, email, COUNT(*) FROM orders GROUP BY user_id — the email column is not in GROUP BY, which is a syntax error in strict-mode databases (PostgreSQL) and silently wrong in MySQL's default mode.
Mistake 5: Using NOT IN when the subquery can return NULLs. NOT IN with any NULL in the list returns no rows (due to three-valued SQL logic). Use NOT EXISTS or LEFT JOIN + IS NULL instead.
Join and Aggregation Interview Template
When building a join query:
- Identify your base table (the "all X" table goes on the LEFT)
- Choose join type: match required → INNER; match optional → LEFT; looking for no-match → LEFT + IS NULL
- Write the ON clause precisely: join on the foreign key, not a fuzzy condition
- Apply WHERE filters before GROUP BY (row-level filters)
- GROUP BY all non-aggregated SELECT columns
- Apply HAVING for aggregate filters (after grouping)
Quick sanity check:
- Does your result include users/entities with 0 events? → You need LEFT JOIN
- Are you looking for "X that never did Y"? → LEFT JOIN + WHERE right.id IS NULL
- Filtering on COUNT/SUM? → HAVING, not WHERE
- SELECT col not in GROUP BY? → wrap it in MIN(col) or move it to GROUP BY