Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

ML System Design: Music Recommendation System

Design Spotify/Apple Music-scale music recommendation end-to-end — from audio-aware product retrieval to sequential session modeling that captures the unique temporal dynamics of music consumption. Covers the production architectures behind Discover Weekly (batch exploration), Radio (session exploitation), and Release Radar (cold start for new tracks). Deep dives into GRU4Rec and SASRec for session modeling, ACARec for artist-catalog-based cold start, contextual bandits on the homepage, skip-rate debiasing, and the listen-skip paradox. Includes what each level (mid/senior/staff) must cover.

65 min read 4 sections 1 interview questions
Music RecommendationsSequential ModelsExplore-ExploitCold StartGRU4RecSASRecBERT4RecContextual BanditsTwo-Tower NetworksAudio FeaturesCollaborative FilteringSkip RateSession ModelingDiscover WeeklyPlaylist Generation

Why Music Recommendation Is Fundamentally Different

Music recommendation shares the retrieval-ranking skeleton with video and e-commerce, but the consumption dynamics are so different that the models, signals, and architectures must be redesigned from the ground up.

Consumption is sequential and session-driven by design. Users don't pick individual songs the way they pick videos. They listen to music in sessions — continuous streams where each track flows into the next. The quality of a music recommendation session is not "was each song good?" but "did the session flow?" A great session might start with upbeat pop, transition through indie folk, and settle into ambient. A bad session is one where the vibe breaks — an aggressive hip-hop track intrudes on a relaxing study playlist. This means the sequence of recommendations matters as much as any individual recommendation.

The skip signal is the richest feedback in the dataset. Unlike video (where leaving is a negative signal but ambiguous) or e-commerce (where not buying is the norm), music has an extremely strong skip signal: the user explicitly pressed skip within the first 10-30 seconds. Spotify's research shows skip rate is one of the strongest predictors of long-term user satisfaction — far stronger than raw completion rate. But skip is biased by position (users skip more at the beginning and end of sessions), by context (skipping more while commuting than relaxing), and by mood alignment.

The catalog is enormous and largely undifferentiated at the surface. 100 million tracks vs. 15K Netflix titles. The challenge is not "which of 15K should I show?" — it's "which of 100 million tracks will this user want to hear next, given the track they're currently listening to, the playlist they're in, and the context of their day?" Content-based features (audio analysis) are load-bearing here in a way they're not for video.

Cold start for music has a unique property: artist catalogs exist. When Drake releases a new song, that song has zero listens. But Drake has thousands of tracks with billions of listens. The artist catalog is a proxy embedding for any new release. This artist-transfer cold start is unique to music and is not present in video or e-commerce in the same way.

This problem is asked at Spotify, Apple Music, YouTube Music, Amazon Music, Tidal, Deezer, SoundCloud, Pandora, and every music streaming platform. Key surface variants include: "What to play next" (radio/autoplay), Discover Weekly (personalized weekly playlist), Release Radar (new music from followed artists), and home page modules.

TIP

What Interviewers Are Evaluating

Mid-level: Can you describe the retrieval-ranking two-stage architecture for music? Do you know collaborative filtering and content-based filtering and why music needs both? Can you explain what makes skip rate a special signal vs. video completion? Can you describe Discover Weekly at a high level (batch personalized playlist generation)?

Senior-level: Can you design a session-aware ranker that models the sequential coherence of a playlist? Do you understand GRU4Rec or SASRec and why RNNs/Transformers are the right architecture here? Can you design cold start for new tracks using artist-catalog transfer? Do you handle the skip bias (users skip more at the start/end of sessions)? Can you design a contextual bandit for homepage calibration across content types (music vs. podcasts vs. audiobooks)? Do you name real production tools: Redis, FAISS, Kafka, Triton?

Staff-level: Do you design the listen-skip paradox mitigation (some skips are positive signals — the user heard enough to know they want something else; some are strongly negative — the track was jarring)? Can you reason about the long-tail discovery mission (helping new/indie artists reach relevant listeners) as a first-class constraint? Do you design the exploration pipeline to ensure music discovery is meaningful, not just random? Can you reason about how Discover Weekly is trained on implicit playlists ('people who listen to X also listen to Y') rather than explicit user-created playlists?

Clarifying Questions — Ask These First

01

Which surface are we designing?

Radio / Autoplay (next track in continuous session — warm context, strong sequential dependency) vs. Discover Weekly (weekly personalized batch playlist — cold context, exploration-heavy) vs. Home page recommendation modules (mixed content types: music, podcasts, audiobooks — requires multi-content type calibration) vs. Search results ranking. For this problem: Radio/Autoplay + Home page, which together cover the majority of listening hours.

02

What scale?

Spotify-scale: ~600M MAU, ~250M DAU, ~100M tracks, ~5B streams/day. A user's listening session lasts an average of 26 minutes (~7 tracks). The system must recommend the next track in <500ms (audio preloading starts while current track is finishing). This is a hard latency requirement — audio buffers must be ready before the track ends.

03

What's the primary business objective?

Session length (minutes streamed per day, per user) or return rate (do users come back tomorrow) or artist discovery (are users finding new music)? These are in tension. Deep exploitation of a user's known preferences maximizes session length today but may lead to boredom-driven churn in 6 months. Spotify explicitly balances these — their Discover Weekly product is a deliberate sacrifice of short-term engagement metrics for long-term retention and satisfaction.

04

Content types in scope?

Music only, or also podcasts, audiobooks, audiodramas? Spotify's home page serves all four. Mixing content types requires calibration — a podcast's engagement patterns are completely different from music's. For this problem: music focus with note on multi-content-type calibration.

05

Explicit or implicit playlists?

Does the user actively curate playlists, or are we designing purely algorithmic session flow? This matters because user-created playlists are gold-standard training data for sequence modeling. Spotify uses user-created playlists as implicit 'this track belongs near this track' training signal.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.