Sections
Related Guides
A/B Test Critique: Finding Flaws in Experiment Designs
Production Engineering
On-Call Incident Response: The First 30 Minutes
Production Engineering
Rewrite vs. Refactor: How to Make the Call Without Destroying the Business
Production Engineering
SLO Design: Error Budgets, Burn Rate Alerts, and the Reliability Tradeoff
Production Engineering
ML Model Evaluation & Production Monitoring: Shadow Mode, A/B Testing & Rollback
ML System Design
Feature Flags: Safe Rollouts, Kill Switches, and the Dark Launch Pattern
A concrete framework for using feature flags as a production engineering tool — not just a product experimentation mechanism. Covers flag taxonomy (release flags vs. experiment flags vs. permission flags vs. ops flags), gradual rollout mechanics, kill switch design, the dark launch pattern for validating system behavior before user-facing traffic, flag cleanup debt, and the technical architecture of a flag evaluation system that adds under 1ms of latency.
Feature Flags Are a Production Engineering Tool, Not Just an A/B Test Mechanism
Most engineers encounter feature flags through A/B testing frameworks and think of them primarily as experimentation tools. This undersells what flags actually provide: the ability to decouple code deployment from feature release.
Before feature flags, deploying code meant releasing the feature. The deploy was the release. This created enormous pressure on every deploy: if the feature had a bug, the only mitigation was a rollback of the entire deploy, which might include other changes, bug fixes, or infrastructure updates you did not want to roll back.
Feature flags break this coupling. You deploy the code with the feature behind a flag in the off state. The code is running in production, but users see no behavior change. You then enable the flag for 1% of users, watch the metrics, enable it for 10%, watch again, and gradually ramp to 100%. If anything goes wrong at any stage, you flip the flag to off and user behavior reverts instantly — no rollback, no redeploy.
This is why high-deployment-frequency organizations (GitHub deploys ~80 times per day, Netflix hundreds) are not reckless — they are disciplined about progressive delivery. Feature flags make every deploy reversible in seconds.
The second major use case that most resources skip: the kill switch. A kill switch is a flag specifically designed to be flipped off in production when a dependency or subsystem is degrading. "If the recommendation service latency exceeds 500ms, disable personalized recommendations and serve the popular items fallback." This is circuit breaker logic implemented as a feature flag, and it makes systems dramatically more resilient to dependency failures.
What Interviewers Are Testing
L4/Mid signal: Knows what feature flags are and why they are useful. Has used a flag system (LaunchDarkly, Unleash, homegrown). Understands gradual rollout as a risk mitigation strategy.
L5/Senior signal: Can describe the four flag types and when to use each. Knows how to implement a kill switch correctly. Understands the dark launch pattern. Has opinions about flag evaluation latency and why in-memory flag state matters. Knows that flag debt is a real problem and has a process for cleanup.
Staff signal: Designs the flag system architecture: evaluation service, caching strategy, flag state propagation latency, and the operational contract for kill switches. Can argue for or against building a homegrown flag system vs. using a vendor. Measures flag evaluation latency impact on P99. Has a systematic flag lifecycle process that prevents accumulation of dead flags.
The Four Flag Types — Each Has a Different Lifecycle
Release Flags (aka Feature Toggles) — Short-lived
Used to hide an incomplete or untested feature from users while it is being built or validated. Lifecycle: created when development starts, ramped to 100% when validation is complete, deleted within 2–4 weeks of full rollout. These flags should not exist for more than 1–2 months. The longer they live, the more they complicate the code path and the more expensive they are to delete. Naming convention: enable_new_checkout_flow — the verb 'enable' signals that this flag has a clear on/off semantic and a planned deletion. Every release flag should have a ticket for its own deletion created at the same time as the flag.
Experiment Flags (A/B Test Flags) — Medium-lived
Used to control which users see which variant of a feature for experimentation. Lifecycle: created for the experiment, deleted when the experiment concludes (typically 2–6 weeks). Differ from release flags in that they may have multiple values (control/variant A/variant B) rather than binary on/off. The experiment system typically manages these flags — they are created and deleted by the experimentation platform, not manually. The flag ownership belongs to the team running the experiment; they are responsible for cleanup.
Permission Flags (Entitlement Flags) — Long-lived
Used to enable features for specific users or user segments based on their plan, role, or entitlement. Example: 'show advanced analytics to enterprise tier users only.' Lifecycle: permanent — these flags do not have planned deletion dates because the entitlement they enforce is permanent. They require the most care in flag evaluation because they are checked on every request for every user. Their evaluation must be extremely fast (in-memory lookup from a pre-fetched user context) and their rollback story differs from other flags — disabling a permission flag affects paying customers.
Ops Flags (Kill Switches / Circuit Breakers) — Long-lived
Used to enable or disable system capabilities in response to operational conditions — degraded dependencies, load shedding, maintenance windows. Example: 'if AI recommendations service latency > 500ms, serve popular items fallback instead.' Lifecycle: permanent — kill switches should never be deleted because you never know when you will need them again. They are the safety valve for operational resilience. Every integration with an external dependency should have an ops flag that allows serving a degraded-but-functional experience when the dependency is unavailable. Naming: ops_disable_personalized_recommendations — the ops_ prefix makes these discoverable and distinguishable from product flags.
Feature Flag System Architecture
Gradual Rollout Mechanics — How to Ramp Safely
Gradual rollout reduces risk by limiting the blast radius of a bug to the fraction of users in the treatment group. But the ramp strategy matters — a naive percentage ramp can produce inconsistent user experiences and invalid metrics.
The consistent hashing requirement. The most important property of a rollout is that the same user sees the same experience on repeated visits. This requires consistent hashing: hash(user_id + flag_id) % 100 < rollout_percentage. This ensures:
- A user in the 10% rollout sees the new feature on every page load, not randomly
- When you ramp from 10% to 20%, the original 10% retain the new feature (they are not randomly re-assigned)
- The control group in metrics analysis is stable — the same users are always in control
Never use random assignment per-request (Math.random() < 0.1). This makes the same user see the new feature sometimes and the old feature other times, which produces confusing UX and invalid A/B test metrics (the same user appears in both treatment and control).
The recommended ramp schedule for a typical feature:
- 1% (QA validation, 24 hours): Verify the feature works correctly in production. Check error rates, latency, and business metrics for the treatment group vs. control. Fix bugs before ramping.
- 10% (initial validation, 48–72 hours): Check that metrics are consistent across the 1–10% range. Watch for P99 latency impact. Verify the feature behaves correctly for all user segments (mobile vs. web, logged-in vs. guest).
- 50% (broad validation, 48–72 hours): Watch for metrics that only appear at scale — connection pool exhaustion, cache invalidation patterns, rate limiting from third-party APIs.
- 100% (full launch): Remove the flag from the code, deprecate the flag in the system, create the cleanup ticket.
Ring deployments (used by Microsoft, AWS) are an alternative to percentage rollouts: instead of a random 10% of users, you define rings — internal employees, beta users, specific geographic regions, small customers, large customers. Each ring has well-understood risk tolerance. This is more complex to implement but gives more precise control over who sees the feature and when.
The Kill Switch — Your Most Important Flag
Every service that depends on an external or slow dependency should have a kill switch. This is the pattern that makes the difference between a graceful degradation and a cascading failure.
Kill switch design principles:
(1) Define the fallback behavior before the flag. A kill switch is useless if the code does not have a defined fallback for the disabled case. Before you can disable personalized recommendations, you need to know: what do users see instead? (Popular items? Cached last recommendations? Empty state?) The fallback must be implemented and tested before the kill switch is meaningful.
(2) Kill switches should default to on. The flag should be "disable_feature: false" rather than "enable_feature: true." This means when the flag evaluation system itself has a failure, the feature remains enabled — which is the safer default for most features. Exception: features that have known scaling problems under certain conditions may default to disabled.
(3) They must propagate quickly. A kill switch that takes 5 minutes to propagate to all serving instances is not useful during a cascading failure. Kill switch propagation should be via streaming push (SSE or WebSocket), not polling. Target: flag change visible in all instances within 30 seconds. LaunchDarkly's streaming system achieves this in under 500ms.
(4) Test them regularly. Kill switches that are never exercised tend to be broken when you need them. Run a quarterly kill switch exercise: flip the switch in a staging environment, verify the fallback behavior is correct, flip it back. This is the equivalent of fire drills for production resilience.
Kill switch anatomy for a recommendation service:
def get_recommendations(user_id: str) -> list[Product]:
if flag_client.is_enabled("ops_disable_personalized_recs", user_id):
# Kill switch is active — return the fallback immediately,
# do not call the recommendation service
return get_popular_items(limit=10)
try:
return recommendation_service.get_for_user(user_id, timeout=0.5)
except (TimeoutError, ServiceUnavailableError):
# Also fall back if the service is slow/down
# Kill switch lets us do this preemptively without a code change
return get_popular_items(limit=10)
Dark Launch — Validating System Behavior Before User-Facing Traffic
The dark launch pattern: send real production traffic to new code or a new service, but do not show the results to users. Compare the new behavior against the old behavior in the background.
This is the safest possible way to validate that new code is correct under production traffic patterns before any user sees a different experience.
Dark launch use cases:
- Validating a new search ranking model without changing search results users see
- Testing a new database query that should return the same results as the old one
- Load testing a new service version under real traffic before cutover
- Validating that a new payment gateway produces the same charge amounts as the old one
Implementation pattern (shadow traffic):
def handle_search_request(query: str, user_id: str) -> SearchResults:
# Primary path — always runs, results returned to user
primary_results = current_search_engine.search(query, user_id)
# Dark launch path — runs in background if flag is enabled
if flag_client.is_enabled("dark_launch_new_search", user_id):
asyncio.create_task(
shadow_search_and_compare(query, user_id, primary_results)
)
return primary_results
async def shadow_search_and_compare(query, user_id, primary_results):
try:
shadow_results = new_search_engine.search(query, user_id)
# Log divergence for analysis — do not affect the user's response
if shadow_results != primary_results:
metrics.increment("search.shadow.divergence",
tags={"query_type": classify_query(query)})
except Exception as e:
# Shadow failures are non-fatal — never let them affect the primary response
logger.warning("Shadow search failed", error=str(e))
The key constraint: the shadow path must be fire-and-forget. It must not block the primary response, must not add to the primary response latency, and must not propagate exceptions. Shadow failures are expected and acceptable — you are validating new code.
Flag Type Summary — Lifecycle and Architecture
| Flag Type | Lifecycle | Default State | Targeting | Cleanup | Example |
|---|---|---|---|---|---|
| Release flag | 2–8 weeks | Off (dark), ramps to 100% | % rollout by user_id hash | Delete after 100% + 2 weeks stable | enable_new_checkout_flow |
| Experiment flag | 2–6 weeks | 50/50 split or defined allocation | User segment, % split | Delete when experiment concludes | checkout_cta_button_variant |
| Permission flag | Permanent | Varies by user tier/role | User property (plan, role, country) | Never — update targeting rules instead | show_enterprise_analytics_tab |
| Ops flag / Kill switch | Permanent | On (feature enabled); flip to disable | Global or service-level | Never — the safety valve you will need again | ops_disable_personalized_recs |
Level Differentiation: Feature Flag Knowledge by Engineering Level
| Level | Flag Usage | Kill Switch Knowledge | Dark Launch | Flag Debt | What They Miss |
|---|---|---|---|---|---|
| L3 / Junior | Uses flags to hide unfinished features; knows percentage rollout exists | Has heard of kill switches; has not implemented one | Does not know the pattern | Creates flags; never deletes them | Consistent hashing for stable assignment; kill switch as an architecture pattern; flag debt as a systems problem |
| L4 / Mid | Gradual rollout with monitoring; understands A/B flag vs release flag | Has implemented a kill switch; may not have a fallback behavior defined | May know the concept but not the implementation constraints (fire-and-forget) | Creates deletion tickets; doesn't always follow through | Flag evaluation latency impact; ring deployment pattern; dark launch for validating new services |
| L5 / Senior | Knows all four flag types and uses the right one for each use case; consistent hashing | Kill switch design: fallback defined first, default-on, fast propagation, tested quarterly | Implements dark launch with fire-and-forget shadow calls; logs divergence for analysis | Systematic cleanup process; flags with owners and deletion dates | Flag system architecture: evaluation latency, streaming updates, the build-vs-buy decision for the flag service |
| Staff | Designs the flag system: evaluation architecture, caching strategy, propagation SLO | Kill switch coverage as a reliability requirement: every external dependency must have one | Dark launch as a validation gate before any major system migration | Automated flag debt tracking; stale flag count as an engineering health metric | This is the target — flags as a production engineering system, not just a product feature |
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →