Skip to main content

API Gateway Design: Auth, Rate Limiting, Routing, and BFF Patterns

Production API gateway architecture covering Kong, Envoy, AWS API Gateway, and BFF patterns. Where to terminate TLS, validate JWTs, enforce rate limits, aggregate requests — and how to avoid turning the gateway into a distributed monolith.

50 min read 16 sections 8 interview questions
API GatewayKongEnvoyAWS API GatewayBFFJWTOAuth2Rate LimitingCircuit BreakerRequest AggregationProtocol TranslationService MeshApollo GatewayAmbassador

What an API Gateway Actually Is

An API gateway is the single entry point from clients into your backend. In a microservices architecture, it is the place where every external request lands before being routed to one of dozens or hundreds of internal services. Strip away the marketing, and it exists to solve one problem: cross-cutting concerns do not belong in every microservice.

Authentication, TLS termination, rate limiting, request logging, distributed tracing, CORS headers, request/response transformation, schema validation, routing by path or header, API versioning, quota enforcement — these are concerns every service would otherwise reimplement, inconsistently. The gateway centralizes them.

The client no longer needs to know that /api/products lives in the catalog-service and /api/orders lives in order-service and /api/auth/login lives in identity-service. It calls one endpoint, with one auth scheme, one rate limit regime, and one TLS certificate. Internally, the gateway routes to the right service.

This is deceptively simple until you see the failure mode: teams start putting business logic in the gateway because it is convenient. Before long, the gateway is a distributed monolith, slow to deploy, owned by no one, and blocking every feature team. The senior insight: the gateway is for infrastructure concerns that are genuinely universal. Anything product-specific belongs downstream, even if it means some duplication.

The alternative to one shared gateway is BFF (Backend for Frontend) — a gateway per client type (web, iOS, Android, partner API). Each BFF is owned by the team building that client, optimized for its specific needs, and avoids the 'gateway team becomes a bottleneck' pathology.

IMPORTANT

What Interviewers Are Actually Testing

API gateway questions probe whether you understand the boundary between infrastructure and application code. Weak answers treat the gateway as a magical box that 'does auth and routing.' Strong answers are opinionated about what belongs in the gateway vs. what belongs in services, can articulate the difference between gateway, load balancer, and service mesh, and know that rate limiting in a distributed environment is a distributed systems problem, not a config toggle. The highest-signal question is where do you validate JWTs, and how do downstream services trust the gateway? If you say 'every service validates its own JWT,' you have not thought about latency or key rotation. If you say 'gateway validates, passes claims as signed internal headers, downstream trusts a mutual-TLS peer,' you are senior.

Clarifying Questions Before Designing a Gateway

01

Who are the clients, and are their needs similar?

Public web, mobile apps, partner APIs, and internal services have different auth schemes, payload shapes, and rate limit profiles. If they diverge significantly, you want BFFs, not one gateway.

02

What auth model(s) must the gateway support?

Session cookies for web, OAuth2 bearer tokens for mobile, API keys for partners, mTLS for service-to-service. Gateway must multiplex these cleanly.

03

What is the QPS target per endpoint, and what are the latency SLOs?

Low-QPS admin endpoints tolerate gateway overhead; a 10K QPS hot path needs sub-millisecond routing overhead.

04

Which services are gRPC, REST, GraphQL, or WebSocket?

Protocol translation (REST ↔ gRPC, for example) is a common gateway responsibility and changes provider selection.

05

What is the rate limiting model — per-IP, per-user, per-API-key, per-tenant?

Distributed rate limiting at scale needs Redis or equivalent; local-only counters leak quota across gateway instances.

06

Do you need request aggregation, or does the client make multiple calls?

Aggregation (BFF or GraphQL) simplifies clients but concentrates latency. Parallel client calls distribute latency.

07

What is the deployment topology — single gateway, team-owned gateways, or edge+service layers?

One gateway is simple but a SPOF and political bottleneck. Layered gateways are more complex but scale organizationally.

Gateway vs Load Balancer vs Service Mesh

These three overlap enough that senior interviewers ask about them explicitly. The distinctions matter.

Load balancer (L4 or L7). Pure traffic distribution. An L4 LB (AWS NLB, HAProxy in TCP mode) routes by IP/port without understanding HTTP. An L7 LB (ALB, Nginx, HAProxy in HTTP mode) routes by host, path, header. It does health checks, connection pooling, TLS termination, maybe sticky sessions. It does not do business-aware auth, per-user rate limiting, or request transformation.

API gateway. A superset of L7 LB with application-aware features: JWT validation, OAuth2 flows, per-user rate limiting, request/response transformation, protocol translation, API key management, developer portal, quota enforcement. Kong, AWS API Gateway, Apigee, Tyk.

Service mesh. Handles service-to-service (east-west) traffic, typically via sidecar proxies (Envoy) injected next to every service. Concerns: mutual TLS between services, retries, circuit breaking, service discovery, observability, canary routing. Istio, Linkerd, Consul Connect. The mesh does not care about end-user auth or public API surfaces; it cares that order-service can safely and observably talk to payment-service.

The overlap. Envoy is the canonical example: it can be deployed as a standalone L7 LB, as an API gateway (Emissary/Ambassador, Contour), or as a service mesh sidecar (Istio). Same binary, different configs. This confuses candidates who treat the categories as products rather than as positions in the request flow.

Production pattern. Large systems run all three: a CDN/WAF at the absolute edge, an API gateway behind it for north-south (client-to-service) concerns, and a service mesh for east-west (service-to-service) concerns. Each layer handles what it is best at; trying to collapse all three into one tool creates tight coupling and unclear ownership.

Gateway in the Request Flow

Rendering diagram...

Core Responsibilities, Ranked by Importance

Not every gateway does everything. Prioritize by what every production system needs.

1. Routing. Map incoming path/host/header to a backend service. /api/products/*catalog-service. /api/orders/*order-service. api.mobile.example.com → mobile BFF, api.example.com → web BFF. Route rules must be declarative, versioned, and testable — never ad-hoc regex in a config file.

2. TLS termination. Terminate HTTPS at the gateway, re-encrypt to backends with internal certs (or plain HTTP inside a trusted VPC with a service mesh handling mTLS). Centralizing cert management at the gateway is one of the biggest operational wins.

3. Authentication. Validate credentials — JWT signature, OAuth2 token introspection, API key lookup, mTLS client cert. On success, attach verified claims as internal headers (X-User-Id, X-User-Roles, X-Tenant-Id). Downstream services trust these because the gateway stripped any client-provided versions and the network is trusted (mTLS mesh or private VPC).

4. Rate limiting. Enforce quotas per-IP, per-user, per-API-key, per-tenant, per-endpoint. Returns 429 with Retry-After header when exceeded.

5. Observability. Emit structured access logs, metrics (latency histograms per route, 5xx rate per upstream), and distributed tracing headers (traceparent, X-B3-TraceId) so downstream spans are linked to gateway spans.

6. Resilience — circuit breaking and retries. Open circuits when a backend's error rate crosses threshold; retry idempotent requests with exponential backoff and jitter; enforce timeouts so a slow backend cannot exhaust gateway capacity.

7. Transformation. Rewrite paths, inject/strip headers, convert between REST and gRPC, translate error envelopes. Use sparingly — transformation is where business logic sneaks in.

8. Request aggregation / GraphQL. Fetch from multiple backends in parallel, assemble response. Powerful but dangerous — aggregation latency is dominated by the slowest backend.

9. Response caching. Short-TTL caching of idempotent responses at the gateway. Rarely the right layer; put this at the CDN instead.

Authentication: Validate Once, Propagate Claims

Where to validate JWTs is the single most interview-critical gateway question. The options:

Option A: Every service validates. Each microservice independently verifies the JWT signature, checks expiry, and extracts claims. Sounds safe but has real problems: (1) every service needs access to the JWKS endpoint and must cache public keys; (2) key rotation must propagate to every service; (3) signature verification (RS256) costs ~100μs-1ms per request — if 20 internal hops validate, you have added 20ms of CPU work; (4) validation logic duplication leads to inconsistency (Service A rejects a token Service B accepts).

Option B: Gateway validates, claims propagate as headers. Gateway validates the JWT once, then injects verified claims as internal headers: X-User-Id: 12345, X-User-Roles: admin,billing, X-Tenant-Id: acme. Downstream services trust these headers because (1) the gateway stripped any client-provided versions of the same headers before injection, and (2) the network between gateway and services is authenticated via mTLS (service mesh) or private VPC. This is the production standard for most systems.

Option C: Gateway validates plus signed internal tokens. Gateway validates the client JWT, then mints an internal JWT signed with a key only services trust. Each service verifies the internal JWT on receipt. Adds one verification per hop but removes network trust assumption. Used in zero-trust architectures and cross-region setups where mTLS alone is insufficient.

Key rotation. Whichever option you pick, JWKS URIs and automated key rotation are non-negotiable. Public keys should be fetched dynamically and cached for minutes, not hardcoded. The gateway must handle the case where a token is signed with kid it does not yet know — refetch JWKS before rejecting.

Never trust client headers. If the gateway injects X-User-Id, it must strip any client-provided X-User-Id first. This is the most common security bug in gateway configs: forgetting to strip the header lets any client claim to be any user.

Envoy Config: JWT Validation and Claim Injection

yamlenvoy-jwt-filter.yaml
# Validate JWT at the gateway, forward verified claims as headers.
# Downstream services trust X-User-Id because the gateway strips any
# client-provided version before this filter runs.

http_filters:
  - name: envoy.filters.http.jwt_authn
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.jwt_authn.v3.JwtAuthentication
      providers:
        auth0_provider:
          issuer: "https://auth.example.com/"
          audiences:
            - "api.example.com"
          remote_jwks:
            http_uri:
              uri: "https://auth.example.com/.well-known/jwks.json"
              cluster: auth0_jwks_cluster
              timeout: 1s
            cache_duration:
              seconds: 300  # refresh keys every 5 min
          forward: false  # do not forward the original token
          payload_in_metadata: "jwt_payload"

      rules:
        - match:
            prefix: "/api/public"
          # no requires — public endpoints skip auth
        - match:
            prefix: "/api/"
          requires:
            provider_name: auth0_provider

  # After jwt_authn succeeds, lua filter promotes verified claims
  # into internal headers that downstream services consume.
  - name: envoy.filters.http.lua
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
      inline_code: |
        function envoy_on_request(handle)
          -- Strip any client-provided internal headers FIRST.
          handle:headers():remove("x-user-id")
          handle:headers():remove("x-user-roles")
          handle:headers():remove("x-tenant-id")

          local meta = handle:streamInfo():dynamicMetadata():get(
            "envoy.filters.http.jwt_authn"
          )
          if meta and meta["jwt_payload"] then
            local payload = meta["jwt_payload"]
            handle:headers():add("x-user-id", payload["sub"] or "")
            handle:headers():add("x-tenant-id", payload["tid"] or "")
            local roles = payload["roles"] or {}
            handle:headers():add("x-user-roles", table.concat(roles, ","))
          end
        end

Rate Limiting: Distributed Counters Done Right

Rate limiting looks trivial ('count requests, reject over quota') and is genuinely hard when you have 50 gateway instances handling 100K QPS and need consistent enforcement.

Four algorithms to know:

  • Fixed window. Count requests in discrete time buckets (per minute). Simple, fast, but allows 2x burst at window boundary — a user making 100 requests at 00:59 and 100 at 01:00 fits two 'per-minute' windows.
  • Sliding window log. Store timestamp of each request; count those in the trailing N seconds. Accurate but memory-heavy at scale (O(N) per user).
  • Sliding window counter. Interpolate between two fixed windows. Near-exact, O(1) memory, the production default.
  • Token bucket. Refill tokens at rate R, each request consumes one; allows bursts up to bucket size. Best for APIs that want to allow occasional bursts while enforcing long-term rate. Common for payment, AWS APIs.
  • Leaky bucket. Constant outflow rate regardless of input; smooths bursts. Used for traffic shaping rather than quota enforcement.

The distributed counter problem. If every gateway pod counts locally, a user with a 100 RPM quota hitting 50 pods can do 50 × 100 = 5000 RPM. You need shared state.

Redis is the workhorse. Atomic INCR + EXPIRE on a key like ratelimit:user:12345:minute:2024-04-23T10:45 gives you exact counting across all gateway instances. Sub-millisecond latency on a nearby Redis cluster. Envoy's rate limit service, Kong's rate-limiting plugin, and AWS API Gateway's usage plans all reduce to variations on this pattern.

Cost optimization. At very high QPS, even Redis becomes a bottleneck. Large-scale systems (Stripe, AWS API Gateway) use approximate counting: each gateway pod tracks local counts and syncs to a central store every ~1 second, accepting bounded over-quota in exchange for massively reduced central-store load. Users do not notice 1% over-quota; central counter can.

Response contract. On rate limit exceeded: HTTP 429, Retry-After: <seconds> header, a JSON body explaining which limit was hit and when it resets. X-RateLimit-Remaining and X-RateLimit-Limit on all successful responses lets well-behaved clients self-throttle.

Distributed Rate Limit with Redis

pythonrate_limit.py
import time
import redis
from dataclasses import dataclass

@dataclass
class RateLimitResult:
    allowed: bool
    remaining: int
    reset_at: int

# Sliding window counter — O(1) memory, near-exact accuracy.
# Runs server-side via Lua for atomicity across the three ops.
SLIDING_WINDOW_LUA = """
local key = KEYS[1]
local window_size = tonumber(ARGV[1])  -- seconds
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local current_window = math.floor(now / window_size)
local previous_window = current_window - 1
local elapsed_in_current = now - current_window * window_size
local weight_previous = 1.0 - elapsed_in_current / window_size

local current_count = tonumber(redis.call('GET', key .. ':' .. current_window) or '0')
local previous_count = tonumber(redis.call('GET', key .. ':' .. previous_window) or '0')

local weighted_total = current_count + previous_count * weight_previous
if weighted_total >= limit then
    return {0, 0, (current_window + 1) * window_size}
end

redis.call('INCR', key .. ':' .. current_window)
redis.call('EXPIRE', key .. ':' .. current_window, window_size * 2)
local remaining = limit - math.floor(weighted_total) - 1
return {1, remaining, (current_window + 1) * window_size}
"""

class RateLimiter:
    def __init__(self, redis_client: redis.Redis):
        self.r = redis_client
        self.script = self.r.register_script(SLIDING_WINDOW_LUA)

    def check(self, user_id: str, endpoint: str,
              limit: int, window_seconds: int) -> RateLimitResult:
        key = f"rl:{user_id}:{endpoint}"
        now = int(time.time())
        allowed, remaining, reset_at = self.script(
            keys=[key],
            args=[window_seconds, limit, now]
        )
        return RateLimitResult(
            allowed=bool(allowed),
            remaining=int(remaining),
            reset_at=int(reset_at)
        )

Request Aggregation and the BFF Pattern

A mobile app rendering a product detail screen needs: product info, inventory, reviews summary, recommendations, user's wishlist status. Five microservices. Three options:

Option A: Client makes five parallel calls. Simple backend, but the client juggles five loading states, five error cases, and five retries. Battery and radio usage are worse (mobile). Latency is dominated by the slowest call regardless.

Option B: Gateway aggregates. Client makes one call; the gateway fans out to five services in parallel, assembles one response. Client stays simple. But now the gateway owns product-screen business logic — which services to call, how to merge responses, how to handle partial failures.

Option C: BFF (Backend for Frontend). A dedicated aggregation service per client type (web BFF, mobile BFF, partner BFF). The BFF is owned by the client team, not the platform team. It aggregates, transforms, and tailors responses for its specific client. Each BFF is a small service, not a monolith.

GraphQL is structurally a BFF. Apollo Gateway and similar stitch a single GraphQL schema over multiple backends; the client asks for exactly the fields it needs; the gateway resolves each field against the owning service. This is elegant when the field-per-service mapping is clean and a disaster when it is not (N+1 query problems, cross-service joins).

The senior principle. Aggregation belongs in a BFF, not in the shared gateway. When you let the shared gateway aggregate, every client team's feature becomes a gateway-team ticket. You have recreated the monolith you broke up — just with more YAML. BFFs scale organizationally because each client team controls its own aggregation.

When not to aggregate at all. If the client is sophisticated (desktop web) and latency is not dominant, parallel client calls are often better than aggregation: failures are isolated, caching per call is easier, no single service becomes a cross-domain dependency magnet.

Deployment Patterns: Single, Per-Team, or Layered

Single shared gateway. One Kong or AWS API Gateway cluster, all traffic flows through it. Simple, centralized policy enforcement, one place to monitor. Downsides: political bottleneck (gateway-team ticket for every new route), deploy risk (change breaks everyone), and scale ceiling. Works up to ~100 services, a few teams, low-to-moderate QPS.

Gateway per team or domain. Each team owns a gateway for its services. Decentralized, teams move faster, blast radius is smaller. Downsides: cross-cutting concerns (auth, rate limit, logging) are duplicated across many gateways; clients need to know which gateway hosts which API; inconsistent behavior between gateways. Works for large engineering orgs with strong platform support to share configs.

Layered: edge gateway plus service gateways. Thin edge gateway handles universal concerns (TLS, DDoS, coarse auth, global rate limit), then routes to domain gateways that handle domain-specific concerns (per-tenant rate limit, business auth, aggregation). This is the Netflix / Amazon pattern and the closest thing to an industry default for very large systems. Adds a hop (maybe 1-3ms) in exchange for clean ownership boundaries.

The BFF variant of layered. Edge gateway handles TLS and coarse security; behind it, one BFF per client type handles client-specific concerns (auth, aggregation, response shaping); BFFs call shared backend services. Netflix's Zuul and Meta's equivalents follow this shape. It is the production standard for consumer-facing systems at scale.

Mental model. The gateway is an organizational artifact as much as a technical one. Pick the topology that matches your org structure (Conway's Law) — one gateway for a small org, BFFs for many clients, layered for a many-client many-backend org at scale.

API Gateway Products: Strengths and Fit

ProductModelEdge ComputeBest ForAvoid When
KongSelf-hosted or Konnect SaaS; Nginx + Lua or Go pluginsPlugins (Lua/Go/JS)Self-hosted control, rich plugin ecosystem, on-prem and hybridYou want fully managed with zero ops
AWS API GatewayFully managed, REST or HTTP APILambda integrationsAWS-native stacks, pay-per-request, quick startVery high QPS (cost), strict latency (100-200ms typical overhead), non-AWS backends
Envoy (Emissary/Ambassador, Contour)L7 proxy, K8s-native via CRDsWASM filters, LuaKubernetes-first, performance-critical, reusing service mesh primitivesYou want a batteries-included developer portal out of the box
TykSelf-hosted or cloud; Go-basedJSVM, Go pluginsOn-prem API management with developer portal, analyticsNiche ecosystem vs Kong; smaller community
Apigee (Google)Fully managed, enterprise-focusedJS/Java policiesLarge enterprise API products, monetization, analytics, partner APIsStartups (pricing), simple internal use cases
Apollo Router / GatewayGraphQL federationRhai scripts, coprocessorsGraphQL-first architectures, federated schemas across teamsNon-GraphQL backends; GraphQL is wrong tool for the job
AWS ALB + CognitoL7 LB with basic authNone (use Lambda@Edge ahead of it)Simple auth needs, AWS-native, lowest-cost pathRich transformations, per-user rate limit, protocol translation

Failure Modes and Production Pitfalls

Gateway as SPOF. One cluster, one failure domain, whole platform offline. Mitigation: deploy across AZs with auto-scaling; run active-active across regions behind GeoDNS; keep state (rate limit counters) in a separate HA store so restarting a gateway pod is non-disruptive.

Cascading timeouts. Client timeout is 30s. Gateway timeout to backend is 30s. Backend timeout to database is 30s. When the database slows, the entire stack holds connections for 30 seconds before failing — and queues back up. Fix: set decreasing timeouts going inward (client 10s, gateway 8s, service 5s, DB 2s) so inner layers fail fast and outer layers can retry or degrade. Match with connection pool limits to avoid exhausting the gateway.

Cold starts on serverless gateways. AWS API Gateway + Lambda integrations suffer 200-500ms cold starts on unpopular endpoints. Mitigations: provisioned concurrency, warmer services, or running on persistent containers (ECS/Fargate) instead of Lambda.

Auth key rotation breaking traffic. JWKS endpoint returns a new kid, gateway has cached the old one, every request with a token signed by the new key gets 401. Fix: gateway must refetch JWKS on unknown kid (with a rate limit to prevent DoS); overlap rotation windows so old keys remain valid for a grace period.

Rate limit counter drift. Redis becomes unavailable; gateway fails open (allowing all traffic) or fails closed (blocking all). Neither is always right — pick per endpoint. Auth endpoints fail closed (better to block than DDoS origin); public read endpoints fail open (do not take the site down because Redis hiccupped).

Logic sprawl in gateway config. Someone adds a 'small' request transformation. Three years later the gateway has 2000 lines of Lua/WASM, owned by nobody, tested by nothing, deployed with trepidation. This is the gateway-as-monolith pathology. Enforce: no business logic in gateway config; if it cannot be expressed as routing/auth/rate-limit/transform of headers, it belongs in a service.

Bypass for 'internal' traffic. Teams route around the gateway for service-to-service calls, 'because it is internal anyway.' Now auth enforcement is inconsistent, observability is split, and a compromised service has unfettered access. Fix: all traffic — including internal — goes through the appropriate layer (service mesh for east-west, gateway for north-south); no bypasses.

TIP

Handling a Slow Downstream Service

When a backend slows, the gateway must fail fast, not propagate the slowness. The toolkit: (1) timeouts at every hop, tighter inward than outward; (2) circuit breakers that open after N consecutive failures or when error rate crosses threshold, short-circuiting calls for a cooldown window (30s-few minutes); (3) bulkheads — separate connection pools per backend so a slow service cannot exhaust gateway capacity used by healthy services; (4) retries with exponential backoff and jitter only for idempotent requests, capped at 2-3 attempts to avoid amplifying load; (5) hedged requests for the tail — if a read has not returned in P95, send a second request in parallel and return whichever wins (Google's 'tail at scale' technique, useful above certain latency budgets). Report degraded state to clients with 503 plus Retry-After rather than hanging.

IMPORTANT

What to Say in the Interview

Four anchor points that signal seniority. First, validate JWTs at the gateway, propagate verified claims as internal headers after stripping client-provided versions. Second, rate limiting at scale needs a shared store (Redis); local-only counters leak quota across instances. Third, keep business logic out of the gateway — use BFFs when you need client-specific aggregation. Fourth, layered gateway + service mesh is the scale answer; one mega-gateway is where architectures go to die. When asked about circuit breakers, name the specific failure (slow backend exhausting gateway connections) and the specific fix (bulkheads + circuit breaker + decreasing timeouts inward). The interviewer is looking for someone who has debugged one of these incidents, not someone who has only read the docs.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 20 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →