Sections
Related Guides
API Gateway Design: Auth, Rate Limiting, Routing, and BFF Patterns
High-Level Design
Rate Limiter Design: Token Bucket, Sliding Window, and Distributed Enforcement
High-Level Design
Caching: Strategy, Redis Internals & Distributed Patterns
High-Level Design
Load Balancer Design: L4/L7 Routing, Health Checks, and Failover
High-Level Design
Distributed Systems Patterns
High-Level Design
Authentication & Authorization: JWT, OAuth 2.0, OIDC & Zanzibar ReBAC
Deep-dive on modern auth systems: AuthN vs AuthZ, session vs JWT tradeoffs, OAuth 2.0 flows (Authorization Code + PKCE, Client Credentials, Device), OIDC identity tokens, RBAC vs ABAC vs Google Zanzibar ReBAC, JWT revocation, key rotation, and WebAuthn passkeys for FAANG system design interviews.
Why Authentication Systems Are the Hardest Part of System Design
Auth is the system that every other system trusts. Get it wrong and every endpoint is exposed — get it subtly wrong and you ship silent vulnerabilities that take years to discover. Real-world failures follow predictable patterns: algorithm confusion attacks on JWTs (RS256 signed keys verified as HS256 with the public key as secret — CVE-2016-10555), missing aud claim validation letting tokens from one service be replayed at another (the "confused deputy"), session fixation in shared workstations, and OAuth Implicit flow leaking tokens via URL fragments to browser history.
The interview question "design authentication for a multi-tenant SaaS" tests whether you can navigate five orthogonal decisions without conflating them:
- AuthN vs AuthZ — who are you, vs what can you do. Separate systems, separate failure modes.
- Session vs stateless token — where you store state (server DB vs client JWT) changes revocation, scale, and latency tradeoffs.
- OAuth flow selection — Authorization Code + PKCE for web/mobile, Client Credentials for service-to-service, Device Code for TVs. Wrong flow = wrong security model.
- Authorization model — RBAC (simple, rigid), ABAC (flexible, attribute-driven), ReBAC (relationship graphs — Google Docs "can user X edit this file because they're in team Z").
- Key management — signing key rotation, refresh token rotation, JWK sets. The hardest operational problem in auth.
Staff-level answers distinguish these explicitly. Mid-level answers conflate AuthN with AuthZ, propose JWTs without a revocation plan, and pick RBAC without considering when it breaks (it breaks the moment you add "share a document with one user from another org").
What Interviewers Actually Evaluate
Auth system design questions probe five distinct competencies — cover all five to hit staff bar.
- Conceptual clarity: Can you say "authentication is who, authorization is what" without fumbling? Can you explain why OAuth is not an authentication protocol and why OIDC was built on top of it?
- Token mechanics: Do you know a JWT payload is base64, not encrypted (anyone can read it)? Do you know the difference between HS256 (shared secret — breaks at scale because every verifier holds the signing key) and RS256 (asymmetric — only the issuer signs, verifiers hold the public key)?
- OAuth flow fluency: Can you name four flows, match them to use cases (web app = Authorization Code with PKCE, SPA = same, mobile = same, CLI/TV = Device Code, service-to-service = Client Credentials), and explain why Implicit flow was deprecated?
- Revocation: The canonical trick question. JWTs are stateless — how do you revoke one? (Short TTL + refresh tokens + optional blocklist.)
- Authorization at scale: For "can user X edit document Y," when does RBAC break? What does Google Zanzibar / SpiceDB solve? The candidate who mentions Zanzibar without prompting stands out.
The anti-signal: "just use JWT," "store passwords hashed" (without saying bcrypt/Argon2 with specific cost parameter), or proposing OAuth Implicit flow in 2024.
Clarifying Questions Before You Design
Is this authentication or authorization — or both?
If the interviewer asks 'design login for a social app,' that's AuthN with SSO. 'Design permission checks for Google Docs' is AuthZ. Scope matters.
What clients — web, mobile, CLI, service-to-service?
Each maps to a different OAuth flow. Web SPA = Auth Code + PKCE. Mobile = same. Backend cron job = Client Credentials. TV app = Device Code.
Single-tenant or multi-tenant? Is SSO required?
Multi-tenant SaaS with enterprise customers needs SAML/OIDC federation with customer IdPs (Okta, Azure AD). That's a different architecture than a consumer app.
What's the scale — 10k users or 100M?
Session tables in Postgres work to ~1M users. Above that, JWT + refresh tokens or session data in Redis cluster. 100M users needs sharded session store or JWT.
What compliance / revocation requirements?
SOC 2 / HIPAA requires audit logs of auth events and rapid session revocation. PCI-DSS forces short token lifetimes. These constrain the AuthN architecture.
What authorization complexity — roles, attributes, or relationships?
Roles (admin/user) = RBAC. Attributes (dept + clearance + time-of-day) = ABAC. Relationships ('can edit document X because in team Y which owns folder Z') = ReBAC / Zanzibar. Pick the simplest that covers requirements.
What MFA / account-recovery requirements?
Consumer app: TOTP or WebAuthn (passkeys). Enterprise: hardware keys, SAML. Recovery flows are the #1 source of account takeovers — design them as first-class.
AuthN vs AuthZ — The Distinction That Interviewers Actually Test
Authentication (AuthN) answers "who are you?" It verifies identity — a password check, a hardware key signature, a SAML assertion from Okta. The output is an authenticated principal: a user ID, an email, plus claims about how the authentication happened (MFA used? single sign-on? session age?).
Authorization (AuthZ) answers "what can you do?" Given an authenticated principal and a requested action on a resource, decide allow/deny. The input is the principal, the action (docs.edit), the resource (document:abc123), and the context (IP, time, MFA recency). The output is a boolean plus a reason.
These are separate systems for separate reasons:
- Different failure modes: a broken AuthN means anyone can pretend to be anyone. A broken AuthZ means authenticated users do things they shouldn't.
- Different cadence: AuthN happens once per session (plus refresh). AuthZ happens on every request — must be fast (< 10ms).
- Different data stores: AuthN holds user records, passwords, MFA factors. AuthZ holds role assignments, ACLs, or relationship graphs.
- Different scaling concerns: AuthN handles login spikes (Monday morning). AuthZ handles every API call — needs caching, precomputed materialized views.
The conflation trap: candidates design "a login system" and include permission checks in the same service. In production this is wrong — AuthN and AuthZ are independent microservices. Auth0 and Okta are AuthN platforms. Google Zanzibar, AWS IAM, and SpiceDB are AuthZ platforms. The boundary is the access token — AuthN issues it, AuthZ consumes it.
Sessions vs Tokens — Where You Store State
Every request after login needs to carry identity. You have two choices: session-based (server stores state, client holds an opaque ID) or token-based (client holds self-contained signed token, server stores nothing).
Session-based (cookie holds session_id=abc123, server looks up abc123 in a session store):
- Pros: trivial revocation (delete the session row), small client payload (~32 bytes), full state lives server-side so you can attach arbitrary data.
- Cons: every request hits the session store — must be a fast cache (Redis). Sticky sessions or a shared session store are required for horizontal scaling. Cross-domain sessions need careful cookie configuration (
SameSite,HttpOnly,Secure). - Scale: Redis-backed sessions work to ~10M concurrent sessions per Redis cluster. Beyond that, shard by session ID.
Token-based (JWT or opaque bearer token):
- Pros: stateless verification — no server lookup needed if the token is signed (JWT). Scales to any number of concurrent sessions without a session store. Good for microservices — every service validates independently with the public key.
- Cons: revocation is hard (the whole point of statelessness is that you don't check a server). Solutions: short TTL (5-15 min access tokens) + refresh tokens + optional blocklist for high-value revocations. Payload size is larger (500-2000 bytes typical).
- Scale: JWT is the right choice when you have >100 services that all need to validate identity — the alternative is every service hitting a central session store on every request.
The decision tree is simple: monolith or small service mesh → sessions (simpler, trivial revocation). Large microservices architecture or multi-region → JWT (avoid the shared session store becoming a bottleneck or SPOF). Hybrid is common: JWT for cross-service identity propagation, but a central session record in the AuthN service to support global logout.
Session vs JWT Decision Matrix
| Dimension | Server Session | JWT (Stateless) | Winner |
|---|---|---|---|
| Revocation | Delete row — instant | Hard — needs blocklist or short TTL | Session |
| Latency per request | ~1-5ms (Redis lookup) | < 1ms (public-key verify, cached) | JWT |
| Payload size | ~32 bytes (opaque ID) | 500-2000 bytes (signed claims) | Session |
| Horizontal scale | Needs shared Redis / sticky sessions | Fully stateless — any replica | JWT |
| Cross-service use | Every service hits session store | Every service verifies independently | JWT |
| Data inspection by client | Opaque — client sees nothing | Base64 payload — client reads all claims | Session (privacy) |
| Audit / session listing | Trivial — SELECT * from sessions | Requires separate tracking table | Session |
| Best for | Monolith, trusted single domain | Microservices, multi-region, SSO federation | Depends |
JWT Deep Dive — Structure, Signing, and the Common Mistakes
A JWT is three base64url-encoded parts joined by dots: header.payload.signature. Example: eyJhbGciOiJSUzI1NiIsImtpZCI6ImtleTEifQ.eyJzdWIiOiJ1c2VyXzEyMyIsImV4cCI6MTcxNDE1NjgwMH0.signature_bytes_base64.
Header: {"alg": "RS256", "kid": "key1"} — algorithm and key ID (for rotation).
Payload (claims): {"iss": "auth.example.com", "sub": "user_123", "aud": "api.example.com", "exp": 1714156800, "iat": 1714155900} — issuer, subject (user), audience, expiry, issued-at.
Signature: the signing server signs base64(header).base64(payload) with its private key (RS256) or shared secret (HS256).
The #1 JWT mistake — writing sensitive data into the payload. The payload is base64-encoded, not encrypted. Anyone who gets the token (via any console.log, any sent header in a log line, any network capture) can decode it and read every claim. Email, role, internal user ID — all readable. Never put anything in a JWT you wouldn't print in a log.
HS256 vs RS256 — pick asymmetric at any scale beyond a single service:
- HS256 (HMAC-SHA256): symmetric — signer and verifier share a secret. If 10 services need to verify tokens, they all hold the signing key — one compromised service leaks the ability to mint tokens for the entire platform. Acceptable only in a single monolith or tightly-trusted environment.
- RS256 (RSA-SHA256): asymmetric — auth service signs with private key, everyone verifies with public key. Standard for production. Publish a JWK set (
/.well-known/jwks.json) so verifiers can fetch and cache public keys. Supports key rotation with akidheader — include multiple keys in the JWK set during rotation.
Key validation checks every verifier must do:
- Signature valid (RSA verification).
expnot passed (token not expired).issmatches expected issuer (prevents tokens from random auth servers).audmatches this service's identifier (prevents confused deputy — a token minted for service A replayed at service B).nbfif present ("not before" — token not yet valid).
Skipping aud validation is the second-most-common real-world JWT bug after base64-vs-encryption confusion.
OAuth 2.0 Flows — Pick the Right One for Your Client
OAuth 2.0 is a delegation framework — it lets a user grant an app access to a resource without sharing their password. There are five flows; three are still recommended, one is situational, one is deprecated.
Authorization Code + PKCE (the default — use this): user redirects to auth server, logs in, gets a one-time code, client exchanges code + PKCE code_verifier for tokens. PKCE (Proof Key for Code Exchange) binds the code exchange to the client that initiated the flow — even if the code leaks (in the URL, in browser history, in a logging system), an attacker can't exchange it without the verifier. Use for: web apps (with or without backend), SPAs, mobile apps. This is the only correct answer for user-facing clients in 2024.
Client Credentials (service-to-service): backend service presents its own client_id + client_secret directly to the token endpoint, gets an access token scoped to itself. No user involved. Use for: cron jobs, backend-to-backend API calls, CI/CD pulling secrets, internal microservices. The client_secret lives in your secret store (Vault, AWS Secrets Manager) — never in source.
Device Code (input-constrained devices): user is shown a short code on their TV, logs in on their phone via a URL, enters the code. TV polls the token endpoint until the user approves. Use for: smart TVs (Netflix on Apple TV), CLI tools (gcloud auth login, aws sso login), IoT with limited input.
Resource Owner Password Credentials (ROPC): user gives their password to the client app directly, client exchanges password for token. Deprecated by OAuth 2.1. Only use if migrating a legacy system where the user explicitly trusts the client (first-party mobile apps pre-OAuth). Avoid if at all possible — it defeats the entire purpose of OAuth.
Implicit flow: token returned in the URL fragment. Deprecated due to token leakage via browser history, referer headers, and logs. Every resource previously using Implicit should migrate to Authorization Code + PKCE. If a candidate proposes Implicit in 2024, that's a strong negative signal.
OAuth vs OIDC — Why Both Exist
OAuth 2.0 is authorization (delegation): "can app X access resource Y on user's behalf?" It issues access tokens scoped to APIs. It does not tell the app who the user is — the access token's contents are opaque to the client by design.
OpenID Connect (OIDC) is a thin identity layer on top of OAuth 2.0. It adds an id_token (a JWT) alongside the access token, containing user claims (sub, email, name, picture). Now the client knows who the user is — OIDC is authentication built on OAuth's flow machinery.
The practical mapping:
- "Sign in with Google" → OIDC. You want to know the user's identity. The
id_tokengives you their email and Google user ID. - "Access my Google Calendar" → OAuth. You want delegated access to the Calendar API. The access token has scope
calendar.read. - Usually both — when a user clicks "Sign in with Google" in your app, you use OIDC to get their identity and OAuth to get an access token for their Google data (if your app needs that).
Three OIDC concepts that matter in interviews:
id_tokenvs access token: theid_tokenis meant for the client (signed JWT with user claims, verified by the client). The access token is meant for resource servers (opaque to client, validated by API). Don't use anid_tokento call APIs — wrong audience.- UserInfo endpoint: OIDC defines
/userinfo— the client can exchange the access token for user claims even without parsing theid_token. Useful when claims change mid-session. - OIDC Discovery:
/.well-known/openid-configurationreturns the auth server's endpoints and supported flows. All major IdPs (Google, Okta, Azure AD, Auth0) implement it — write your integration against discovery metadata, not hardcoded URLs.
RBAC vs ABAC vs ReBAC — When to Use Each
| Model | Data Shape | Query Cost | Best For | Breaks When |
|---|---|---|---|---|
| RBAC | (user, role) + (role, perm) tables | O(1) — single lookup | Simple apps: admin/user distinction, internal tools | Permissions depend on resource ownership or sharing |
| ABAC | Policies evaluated over attributes | O(policy size) — rule evaluation | Compliance-driven (SOC 2, HIPAA), AWS IAM-style APIs | Needs arbitrary graph relationships |
| ReBAC (Zanzibar) | Tuples (user, relation, object) in a graph | O(graph traversal) — cached ~5ms | Collaboration (Docs, Drive, Dropbox), social graphs | Small apps — massive overhead vs benefit |
| Hybrid (common) | RBAC for coarse, ABAC/ReBAC for fine | Two-tier check | Most production systems at scale | Requires careful policy boundary design |
Key Storage, Rotation, and JWK Sets
Signing keys are the crown jewels — anyone with the private key can mint valid tokens for your platform. Three operational practices are non-negotiable.
1. Store signing keys in HSM / KMS, never in config: AWS KMS, GCP Cloud KMS, or a hardware HSM holds the private key. The auth server signs by calling the KMS API — the raw private key never leaves the HSM boundary. A leaked environment file or a compromised app server doesn't leak the signing key. Performance: KMS sign calls are ~5-20ms; cache them if you're issuing >100 tokens/sec (many auth servers sign in-process with a key fetched at startup and refreshed periodically).
2. Rotate signing keys regularly (quarterly minimum): publish a JWK set at /.well-known/jwks.json containing multiple active keys. During rotation, add the new key to the set and start signing with it (via the kid header) while continuing to verify tokens signed by the old key. After all old tokens have expired (wait max_token_lifetime + buffer), remove the old key. Verifiers cache the JWK set with a TTL (15 min to 1 hour); they pick up new keys automatically.
3. Rotate refresh tokens on every use: issue a new refresh token each time the client exchanges one for an access token; invalidate the previous. If an attacker steals a refresh token and uses it, the legitimate client's next refresh fails — detectable as a refresh token reuse attack. On detection, invalidate the entire token family (all tokens descended from the stolen one) and force re-authentication. This is the OAuth 2.1 recommendation.
Passwords: store with Argon2id (cost parameters: m=64MB, t=3, p=4 minimum as of 2024) or bcrypt (cost factor 12+). Never MD5, SHA-1, SHA-256 — those are for data integrity, not passwords. Always use a unique per-user salt. At login, compare with constant-time comparison to prevent timing attacks revealing hash prefix.
JWT Validation with JWK Caching and Full Claim Checks
import time
from typing import Optional
from dataclasses import dataclass
import requests
from jwt import PyJWKClient, InvalidTokenError
import jwt as pyjwt
@dataclass
class AuthContext:
user_id: str
issuer: str
audience: str
scopes: list[str]
issued_at: int
expires_at: int
class JWTValidator:
"""Validates RS256 JWTs issued by a trusted auth server.
- Fetches and caches the JWK set (1-hour TTL).
- Validates signature, exp, iss, aud, nbf in one pass.
- Raises InvalidTokenError on any failure — never silently accepts.
"""
def __init__(self, jwks_uri: str, expected_issuer: str,
expected_audience: str, clock_skew_sec: int = 30):
self._jwks_client = PyJWKClient(jwks_uri, cache_keys=True,
lifespan=3600) # 1h cache
self._issuer = expected_issuer
self._audience = expected_audience
self._clock_skew = clock_skew_sec
def validate(self, token: str) -> AuthContext:
"""Returns AuthContext on success, raises InvalidTokenError on failure."""
try:
# Picks the correct key via `kid` header — supports rotation.
signing_key = self._jwks_client.get_signing_key_from_jwt(token)
except Exception as e:
raise InvalidTokenError(f"Failed to fetch signing key: {e}")
# Decode + verify — pyjwt checks exp, iss, aud, nbf, signature all at once.
# Do NOT disable any of these — each is a real attack vector.
payload = pyjwt.decode(
token,
signing_key.key,
algorithms=["RS256"], # explicit allow-list — prevents alg confusion
audience=self._audience,
issuer=self._issuer,
leeway=self._clock_skew,
options={
"require": ["exp", "iat", "iss", "aud", "sub"],
"verify_signature": True,
"verify_exp": True,
"verify_iat": True,
"verify_aud": True,
"verify_iss": True,
},
)
return AuthContext(
user_id=payload["sub"],
issuer=payload["iss"],
audience=payload["aud"] if isinstance(payload["aud"], str)
else payload["aud"][0],
scopes=payload.get("scope", "").split(),
issued_at=payload["iat"],
expires_at=payload["exp"],
)
# Usage in an API middleware
validator = JWTValidator(
jwks_uri="https://auth.example.com/.well-known/jwks.json",
expected_issuer="https://auth.example.com",
expected_audience="https://api.example.com",
)
def middleware(request) -> AuthContext:
auth_header = request.headers.get("Authorization", "")
if not auth_header.startswith("Bearer "):
raise InvalidTokenError("Missing Bearer token")
token = auth_header[len("Bearer "):]
return validator.validate(token) # raises on any failure
MFA and Passwordless — WebAuthn Is the Future
Password-only authentication is a broken model: reused passwords, phishing, credential-stuffing attacks from breach databases. Modern auth adds a second factor or removes passwords entirely.
TOTP (Time-based One-Time Password) — RFC 6238, the algorithm behind Google Authenticator, Authy, 1Password TOTP. User scans a QR code that encodes a shared secret; app generates a 6-digit code that rotates every 30 seconds. Strong against password-breach attacks. Weak against phishing — a fake login page can relay the code in real time to the real service.
SMS / email codes — common but compromised. SMS is vulnerable to SIM swap attacks and SS7 protocol flaws (attacker convinces the carrier to port the victim's number). Email is only as secure as the email account. NIST deprecated SMS as a primary second factor in 2017; use only as a fallback and never for high-value accounts (banking, enterprise admin).
WebAuthn / Passkeys — the current gold standard. Cryptographic challenge-response using a hardware-backed key pair (Touch ID, Face ID, YubiKey, Windows Hello). The private key never leaves the device. The server sends a challenge; the client signs with the private key; the server verifies with the public key registered during signup. Phishing-resistant by design — the browser binds the challenge to the domain (example.com), so a phishing site (example-login.com) cannot trigger a valid signature. Apple, Google, and Microsoft are pushing passkeys as password replacement — sync via iCloud Keychain, Google Password Manager, or Windows Hello. Use in 2024 onward for any new system.
Magic links — email a one-click login URL. Low friction, but email inbox security becomes auth security. Acceptable for low-stakes products; inappropriate for anything touching money or sensitive data.
The staff-level recommendation: TOTP + recovery codes as baseline, WebAuthn passkeys as preferred primary, SMS only as account-recovery fallback with explicit user opt-in.
Failure Modes — The Ways Auth Systems Get Broken
Real attacks exploit specific subsystems. Your design should explicitly address each.
JWT algorithm confusion (CVE-2016-10555): verifier accepts tokens signed with alg=HS256 when it should only accept RS256. Attacker takes the public key, uses it as the HMAC secret, and forges tokens. Mitigation: hardcode algorithms=["RS256"] in the verifier — never trust the alg header alone.
Missing aud claim validation (confused deputy): service A issues a JWT; attacker replays it at service B; service B accepts it because both trust the same issuer. Mitigation: every service validates aud matches its own identifier.
Stolen refresh token: attacker exfiltrates a long-lived refresh token via XSS or a compromised device. Mitigation: rotate refresh tokens on every use + detect reuse (if a used token is presented again, the family is stolen — revoke all). Bind tokens to device via DPoP or mTLS when available.
Token replay: attacker captures a valid JWT and replays it before expiry. Mitigation: short access-token TTL (5-15 min), jti claim + seen-set for high-value operations, TLS everywhere (no plaintext auth headers).
Timing attacks on password comparison: naïve == comparison returns early on the first mismatched byte — an attacker measures response times to recover the hash. Mitigation: hmac.compare_digest() or equivalent constant-time comparison.
Session fixation: attacker sets the victim's session ID before login; after login the attacker knows the session. Mitigation: regenerate session ID on every privilege escalation (login, MFA, permission grant).
SSRF via OIDC discovery: attacker points your server at a malicious .well-known/openid-configuration URL, gets it to fetch internal resources. Mitigation: allowlist the set of valid issuer URLs.
JWKS poisoning: verifier fetches JWKS from a URL the attacker can influence. Mitigation: pin the JWKS URI to the auth server's known domain + TLS cert.
What to Say in an Auth System Design Interview
A high-signal opening for "design authentication for [large app]":
"I'd separate AuthN from AuthZ as distinct services. AuthN uses OIDC — OAuth 2.0 Authorization Code flow with PKCE for all user-facing clients (web, mobile, SPA) — issuing short-lived RS256 JWT access tokens (5-15 min TTL) plus longer refresh tokens with rotation on every use. Signing keys live in KMS, published via a JWK set with kid-based rotation. For AuthZ, I'd start with RBAC for coarse admin/user roles and use a Zanzibar-style ReBAC system (SpiceDB) for resource-level sharing permissions — the 'can user X edit document Y because they're in team Z' case. For MFA, WebAuthn passkeys as primary, TOTP as fallback, SMS only for recovery. Revocation is handled by short token TTL + a blocklist for high-value revocations (password reset, device loss). For scale, every resource service validates JWT locally with a cached public key — p99 under 1ms — rather than hitting a central session store. I'd fail-closed on all auth checks."
That covers the eight axes interviewers care about: AuthN/AuthZ split, OAuth flow, token format, key management, authorization model, MFA, revocation, and scale — in under two minutes. Everything else is drill-down.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 20 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →