Sections

0/11

Related Guides

Window Functions for Analytics: ROW_NUMBER, RANK, LAG/LEAD, and Running Totals

SQL & Databases

40m

SQL Query Optimization: Indexes, Query Plans, and Performance at Scale

SQL & Databases

35m

Subqueries and CTEs: WITH Clauses, Correlated Subqueries, and Recursive Patterns

SQL & Databases

38m

A/B Testing & Experimentation at Scale

Machine Learning

70m

← Back to Library

SQL & Databases·Intermediate

SQL Joins & Aggregations: INNER, LEFT, GROUP BY, and Aggregation Pitfalls

Joins and GROUP BY form the backbone of every SQL interview. This guide covers the four join types with precise NULL semantics, GROUP BY with HAVING, the COUNT(*) vs COUNT(col) distinction that fails most candidates, multi-table aggregation patterns, and the self-join technique for hierarchical and comparison queries asked at Meta, Stripe, and Airbnb.

35 min read 11 sections 7 interview questions

SQL JoinsINNER JOINLEFT JOINGROUP BYHAVINGAggregationsCOUNTSelf JoinNULL SemanticsMulti-table SQLInterview SQLData Analysis

Why Joins and Aggregations Are the SQL Interview Foundation

Every SQL interview question — from "find users who have never ordered" to "compute revenue by category" — reduces to two operations: joining tables to combine related data, and aggregating rows to compute summaries. Master these two, and ~80% of SQL interview questions become straightforward.

The interview is not testing your ability to remember syntax. It is testing:

Join type selection: knowing which join to use when — specifically when to use LEFT JOIN and when its absence is a bug (not an oversight)
NULL propagation: understanding that LEFT JOIN produces NULLs in the right-table columns for unmatched rows, and using this deliberately for "find records with no match" patterns
Aggregation correctness: knowing that COUNT(*) and COUNT(col) return different results, that HAVING filters after aggregation while WHERE filters before, and that GROUP BY must include all non-aggregated SELECT columns

At Meta data interviews, ~60% of questions require a LEFT JOIN to find gaps (users without purchases, events without conversions). At Stripe, multi-table aggregations with implicit join order determine whether you understand query plans.

IMPORTANT

What the Interviewer Is Actually Testing

Interviewers test three join-related signals:

Do you reach for LEFT JOIN when the question involves "users who have not done X"? Most candidates use INNER JOIN and miss all the nulls. LEFT JOIN + WHERE right_table.id IS NULL is the pattern — it appears in ~30% of hard SQL interview questions.
Do you know that COUNT(col) ignores NULLs but COUNT() counts all rows? The difference matters in LEFT JOIN results: COUNT(order_id) counts only matched users; COUNT() would count all users including unmatched ones with NULL order_ids.
Do you know when HAVING is required vs WHERE? WHERE runs before GROUP BY. HAVING runs after. Filtering on an aggregate (e.g., "users with more than 3 orders") requires HAVING, not WHERE.

SQL Join Type Decision Tree

Rendering diagram...

INNER JOIN vs LEFT JOIN — The NULL Semantics That Matter

INNER JOIN returns only rows where the join condition matches in both tables. Any row in the left table with no matching row in the right table is dropped silently.

LEFT JOIN (LEFT OUTER JOIN) returns all rows from the left table. For rows with no match in the right table, all right-table columns are NULL.

The silent drop problem: "List all users and their total purchase amounts" sounds like an INNER JOIN query but it's not. Users with no purchases should appear with amount = 0 or NULL. An INNER JOIN returns only users who have purchased — a plausible but incorrect result.

Correct pattern: SELECT u.user_id, COALESCE(SUM(o.amount), 0) AS total_spend FROM users u LEFT JOIN orders o ON u.user_id = o.user_id GROUP BY u.user_id

The anti-join pattern ("users who have never ordered"): SELECT u.user_id, u.email FROM users u LEFT JOIN orders o ON u.user_id = o.user_id WHERE o.order_id IS NULL -- NULLs only exist for users with no matching orders

This is more efficient than NOT IN (vulnerable to NULLs) and NOT EXISTS (correct but verbose). The LEFT JOIN + IS NULL approach is the standard interview answer.

Join Patterns: Left Join, Anti-Join, and Self-Join

sqljoin-patterns.sql

-- Pattern 1: All users with their order count (LEFT JOIN for users with 0 orders)
SELECT
  u.user_id,
  u.email,
  COUNT(o.order_id) AS order_count,   -- COUNT(col) ignores NULLs → correct count
  COALESCE(SUM(o.amount), 0) AS total_spend
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, u.email
ORDER BY total_spend DESC;

-- Pattern 2: Anti-join — users who have NEVER placed an order
SELECT u.user_id, u.email, u.created_at
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.order_id IS NULL;  -- only unmatched users have NULL here

-- Alternative: NOT EXISTS (equivalent, sometimes faster with the right index)
SELECT u.user_id, u.email
FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);

-- Pattern 3: Self-join — find employees who earn more than their manager
SELECT
  e.employee_id,
  e.name          AS employee_name,
  e.salary        AS employee_salary,
  m.name          AS manager_name,
  m.salary        AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id  -- same table, aliased twice
WHERE e.salary > m.salary;

-- Pattern 4: Multi-table join — orders with product and user details
SELECT
  o.order_id,
  u.email,
  p.product_name,
  oi.quantity,
  oi.unit_price,
  oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN users u        ON o.user_id     = u.user_id
JOIN order_items oi ON o.order_id    = oi.order_id
JOIN products p     ON oi.product_id = p.product_id
WHERE o.status = 'completed'
  AND o.created_at >= '2024-01-01'
ORDER BY o.order_id, oi.order_item_id;

GROUP BY and HAVING — Aggregation Mechanics

GROUP BY collapses all rows with the same grouping values into a single output row. Every column in the SELECT list must either be in GROUP BY or wrapped in an aggregate function (COUNT, SUM, AVG, MIN, MAX). Violating this rule is a syntax error in most databases.

Key rules:

GROUP BY runs after WHERE but before SELECT
HAVING filters groups after GROUP BY (equivalent to WHERE, but for aggregated results)
You can GROUP BY an expression: GROUP BY DATE_TRUNC('month', created_at)
You can GROUP BY multiple columns: GROUP BY user_id, product_category

COUNT() vs COUNT(col): COUNT() — counts all rows in the group, including those with NULLs in any column COUNT(col) — counts only rows where col is NOT NULL COUNT(DISTINCT col) — counts unique non-NULL values

This distinction matters in LEFT JOIN queries: COUNT(*) would count users with no orders (since there's still a row for them), while COUNT(order_id) correctly returns 0 for users with no orders because their order_id is NULL.

HAVING vs WHERE: WHERE event_type = 'purchase' → filters rows BEFORE grouping (only purchase rows enter GROUP BY) HAVING COUNT() > 3 → filters groups AFTER grouping (only groups with >3 rows kept) WHERE COUNT() > 3 → SYNTAX ERROR — aggregate functions are not allowed in WHERE

GROUP BY, HAVING, and Aggregation Patterns

sqlgroup-by-patterns.sql

-- Basic GROUP BY with HAVING (users with more than 3 purchases in last 90 days)
SELECT
  user_id,
  COUNT(*)                  AS purchase_count,
  SUM(amount)               AS total_spend,
  AVG(amount)               AS avg_order_value,
  MAX(amount)               AS largest_order,
  MIN(created_at)           AS first_purchase,
  MAX(created_at)           AS latest_purchase
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
  AND status = 'completed'       -- filter BEFORE grouping
GROUP BY user_id
HAVING COUNT(*) > 3             -- filter AFTER grouping
ORDER BY total_spend DESC;

-- Multi-dimensional GROUP BY: revenue by month and category
SELECT
  DATE_TRUNC('month', created_at) AS revenue_month,
  category,
  COUNT(DISTINCT user_id)         AS unique_buyers,
  COUNT(order_id)                 AS total_orders,
  SUM(amount)                     AS total_revenue,
  SUM(amount) / COUNT(DISTINCT user_id) AS revenue_per_buyer
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY DATE_TRUNC('month', created_at), category
ORDER BY revenue_month, total_revenue DESC;

-- COUNT(*) vs COUNT(col) in a LEFT JOIN context
-- Correct: COUNT(order_id) = 0 for users with no orders
SELECT
  u.user_id,
  COUNT(o.order_id) AS orders_placed,   -- 0 for users with no orders
  COUNT(*)          AS row_count         -- 1 for every user (wrong for this use case)
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id;

-- GROUP BY with ROLLUP (subtotals + grand total — PostgreSQL, MySQL 8+, BigQuery)
SELECT
  COALESCE(category, 'ALL CATEGORIES')  AS category,
  COALESCE(region, 'ALL REGIONS')        AS region,
  SUM(revenue)                           AS total_revenue
FROM sales
GROUP BY ROLLUP(category, region)
ORDER BY category NULLS LAST, region NULLS LAST;

Aggregation Function Reference

Function	What It Counts	NULL Behavior	Common Use Case
COUNT(*)	All rows in group	Includes rows where any col is NULL	Row count, order count per user
COUNT(col)	Rows where col is NOT NULL	Ignores NULLs	Count non-null events after LEFT JOIN
COUNT(DISTINCT col)	Unique non-NULL values of col	Ignores NULLs	Unique users, unique products
SUM(col)	Sum of non-NULL values	Ignores NULLs	Total revenue, total spend
AVG(col)	Average of non-NULL values	Ignores NULLs — divides by non-NULL count	Average order value
MAX(col) / MIN(col)	Extreme value in group	Ignores NULLs	Latest event date, highest score

Self-Joins: Hierarchical Queries and Comparison Patterns

A self-join joins a table to itself using two different aliases. This is required for:

Hierarchical queries: employees and their managers (same employees table)
Comparison queries: find rows where a value is greater than another row's value in the same table
Consecutive pair detection: sessions where event N is immediately followed by event M

The pattern: alias the same table twice and join on the relationship column.

Employee hierarchy: FROM employees e JOIN employees m ON e.manager_id = m.employee_id

The e alias represents the employee, the m alias represents the manager. Both come from the same table. You can now compare e.salary to m.salary, or select m.name as manager_name.

Common interview variant: "Find all employees whose salary exceeds the average salary in their own department." This requires a self-join or correlated subquery — GROUP BY alone cannot compare a row's value to its own group's aggregate.

Self-join solution: SELECT e.employee_id, e.name, e.salary, dept_avg.avg_salary FROM employees e JOIN ( SELECT department_id, AVG(salary) AS avg_salary FROM employees GROUP BY department_id ) dept_avg ON e.department_id = dept_avg.department_id WHERE e.salary > dept_avg.avg_salary

⚠ WARNING

Join Mistakes That Kill SQL Interviews

Mistake 1: INNER JOIN when LEFT JOIN is required. "List all users with their last login" — if some users have never logged in, INNER JOIN drops them. The result looks complete but is wrong. Develop the habit: any time the question says "all X, even those without Y", reach for LEFT JOIN.

Mistake 2: Forgetting WHERE right_table.id IS NULL in anti-joins. A LEFT JOIN alone returns all users including those with matches. You need the IS NULL filter to isolate unmatched rows.

Mistake 3: WHERE with an aggregate function. "Find users with more than 10 orders" written as WHERE COUNT(orders) > 10 is a syntax error. Move it to HAVING.

Mistake 4: Grouping without including all non-aggregated SELECT columns in GROUP BY. If you SELECT user_id, email, COUNT(*) FROM orders GROUP BY user_id — the email column is not in GROUP BY, which is a syntax error in strict-mode databases (PostgreSQL) and silently wrong in MySQL's default mode.

Mistake 5: Using NOT IN when the subquery can return NULLs. NOT IN with any NULL in the list returns no rows (due to three-valued SQL logic). Use NOT EXISTS or LEFT JOIN + IS NULL instead.

TIP

Join and Aggregation Interview Template

When building a join query:

Identify your base table (the "all X" table goes on the LEFT)
Choose join type: match required → INNER; match optional → LEFT; looking for no-match → LEFT + IS NULL
Write the ON clause precisely: join on the foreign key, not a fuzzy condition
Apply WHERE filters before GROUP BY (row-level filters)
GROUP BY all non-aggregated SELECT columns
Apply HAVING for aggregate filters (after grouping)

Quick sanity check:

Does your result include users/entities with 0 events? → You need LEFT JOIN
Are you looking for "X that never did Y"? → LEFT JOIN + WHERE right.id IS NULL
Filtering on COUNT/SUM? → HAVING, not WHERE
SELECT col not in GROUP BY? → wrap it in MIN(col) or move it to GROUP BY

Interview Questions

Click to reveal answers