Skip to main content

Preview — Pro guide

You are seeing a portion of this guide. Sign in and upgrade to unlock the full article, quizzes, and interview answers.

Design a Multi-Channel Notification Service at Scale

End-to-end design of a notification platform delivering 1B notifications/day across push, email, and SMS. Covers the fan-out problem for broadcast sends, per-provider rate limiting, idempotency, retries, and compliance (GDPR/CAN-SPAM/TCPA) — the system design topics interviewers actually grill on.

55 min read 3 sections 1 interview questions
KafkaRedisFCMAPNsTwilioSendGridNotification ServiceMulti-ChannelIdempotencyFan-OutRate LimitingCircuit BreakerToken BucketGDPR

The Problem: One API, Many Hostile Channels

A notification service sits between product teams (who want to send "your password was reset" or "Taylor Swift dropped a new album") and external providers (FCM for Android push, APNs for iOS, Twilio/Vonage for SMS, SendGrid/SES/Mailgun for email). Each provider has its own quirks: APNs uses a persistent HTTP/2 connection with ~9K max concurrent streams per token; Twilio rate-limits per originating phone number (~100 messages/sec); SES throttles based on reputation and recent bounce rate. The service must absorb all of this complexity so producers can send a message in one call without knowing which rails it rides. The defining challenge is the fan-out spread: a "you got a reply" notification targets 1 user; a marketing blast targets 100M. Both must flow through the same architecture with isolated failure domains, predictable latency for transactional sends, and strict compliance (a user who opted out of marketing must never receive one, regardless of which team or retry path generated it).

IMPORTANT

What Interviewers Are Testing

This is a workhorse interview — it looks simple ("just send messages") but exposes whether you understand producer/consumer decoupling, idempotency, back-pressure, and compliance-by-design. The senior signal comes from:

  • Fan-out strategy: Do you enqueue 100M messages at the producer, or a single broadcast job that expands inside the system? Wrong answer blows up the queue.
  • Preference check placement: Do you filter opt-outs before enqueueing (correct) or at the worker (wastes queue capacity and violates compliance SLAs)?
  • Idempotency: Can you describe a dedup key scheme that survives producer retries without storing unbounded state?
  • Per-provider back-pressure: Do you isolate provider rate limits so a slow SMS provider doesn't starve push notifications?
  • Transactional vs marketing lanes: Do you separate priority classes, or let a marketing blast delay a 2FA code?

Candidates who talk only about "use Kafka and workers" fail the bar. Candidates who discuss provider-specific rate limits, circuit breakers, and idempotency keys pass.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.