Skip to main content

Feature Flags: Safe Rollouts, Kill Switches, and the Dark Launch Pattern

A concrete framework for using feature flags as a production engineering tool — not just a product experimentation mechanism. Covers flag taxonomy (release flags vs. experiment flags vs. permission flags vs. ops flags), gradual rollout mechanics, kill switch design, the dark launch pattern for validating system behavior before user-facing traffic, flag cleanup debt, and the technical architecture of a flag evaluation system that adds under 1ms of latency.

45 min read 10 sections 7 interview questions
Feature FlagsFeature TogglesKill SwitchDark LaunchGradual RolloutProgressive DeliveryLaunchDarklyCanary ReleaseExperimentationProduction EngineeringRelease EngineeringRisk ManagementDeploymentFlag Debt

Feature Flags Are a Production Engineering Tool, Not Just an A/B Test Mechanism

Most engineers encounter feature flags through A/B testing frameworks and think of them primarily as experimentation tools. This undersells what flags actually provide: the ability to decouple code deployment from feature release.

Before feature flags, deploying code meant releasing the feature. The deploy was the release. This created enormous pressure on every deploy: if the feature had a bug, the only mitigation was a rollback of the entire deploy, which might include other changes, bug fixes, or infrastructure updates you did not want to roll back.

Feature flags break this coupling. You deploy the code with the feature behind a flag in the off state. The code is running in production, but users see no behavior change. You then enable the flag for 1% of users, watch the metrics, enable it for 10%, watch again, and gradually ramp to 100%. If anything goes wrong at any stage, you flip the flag to off and user behavior reverts instantly — no rollback, no redeploy.

This is why high-deployment-frequency organizations (GitHub deploys ~80 times per day, Netflix hundreds) are not reckless — they are disciplined about progressive delivery. Feature flags make every deploy reversible in seconds.

The second major use case that most resources skip: the kill switch. A kill switch is a flag specifically designed to be flipped off in production when a dependency or subsystem is degrading. "If the recommendation service latency exceeds 500ms, disable personalized recommendations and serve the popular items fallback." This is circuit breaker logic implemented as a feature flag, and it makes systems dramatically more resilient to dependency failures.

TIP

What Interviewers Are Testing

L4/Mid signal: Knows what feature flags are and why they are useful. Has used a flag system (LaunchDarkly, Unleash, homegrown). Understands gradual rollout as a risk mitigation strategy.

L5/Senior signal: Can describe the four flag types and when to use each. Knows how to implement a kill switch correctly. Understands the dark launch pattern. Has opinions about flag evaluation latency and why in-memory flag state matters. Knows that flag debt is a real problem and has a process for cleanup.

Staff signal: Designs the flag system architecture: evaluation service, caching strategy, flag state propagation latency, and the operational contract for kill switches. Can argue for or against building a homegrown flag system vs. using a vendor. Measures flag evaluation latency impact on P99. Has a systematic flag lifecycle process that prevents accumulation of dead flags.

The Four Flag Types — Each Has a Different Lifecycle

01

Release Flags (aka Feature Toggles) — Short-lived

Used to hide an incomplete or untested feature from users while it is being built or validated. Lifecycle: created when development starts, ramped to 100% when validation is complete, deleted within 2–4 weeks of full rollout. These flags should not exist for more than 1–2 months. The longer they live, the more they complicate the code path and the more expensive they are to delete. Naming convention: enable_new_checkout_flow — the verb 'enable' signals that this flag has a clear on/off semantic and a planned deletion. Every release flag should have a ticket for its own deletion created at the same time as the flag.

02

Experiment Flags (A/B Test Flags) — Medium-lived

Used to control which users see which variant of a feature for experimentation. Lifecycle: created for the experiment, deleted when the experiment concludes (typically 2–6 weeks). Differ from release flags in that they may have multiple values (control/variant A/variant B) rather than binary on/off. The experiment system typically manages these flags — they are created and deleted by the experimentation platform, not manually. The flag ownership belongs to the team running the experiment; they are responsible for cleanup.

03

Permission Flags (Entitlement Flags) — Long-lived

Used to enable features for specific users or user segments based on their plan, role, or entitlement. Example: 'show advanced analytics to enterprise tier users only.' Lifecycle: permanent — these flags do not have planned deletion dates because the entitlement they enforce is permanent. They require the most care in flag evaluation because they are checked on every request for every user. Their evaluation must be extremely fast (in-memory lookup from a pre-fetched user context) and their rollback story differs from other flags — disabling a permission flag affects paying customers.

04

Ops Flags (Kill Switches / Circuit Breakers) — Long-lived

Used to enable or disable system capabilities in response to operational conditions — degraded dependencies, load shedding, maintenance windows. Example: 'if AI recommendations service latency > 500ms, serve popular items fallback instead.' Lifecycle: permanent — kill switches should never be deleted because you never know when you will need them again. They are the safety valve for operational resilience. Every integration with an external dependency should have an ops flag that allows serving a degraded-but-functional experience when the dependency is unavailable. Naming: ops_disable_personalized_recommendations — the ops_ prefix makes these discoverable and distinguishable from product flags.

Feature Flag System Architecture

Rendering diagram...

Gradual Rollout Mechanics — How to Ramp Safely

Gradual rollout reduces risk by limiting the blast radius of a bug to the fraction of users in the treatment group. But the ramp strategy matters — a naive percentage ramp can produce inconsistent user experiences and invalid metrics.

The consistent hashing requirement. The most important property of a rollout is that the same user sees the same experience on repeated visits. This requires consistent hashing: hash(user_id + flag_id) % 100 < rollout_percentage. This ensures:

  • A user in the 10% rollout sees the new feature on every page load, not randomly
  • When you ramp from 10% to 20%, the original 10% retain the new feature (they are not randomly re-assigned)
  • The control group in metrics analysis is stable — the same users are always in control

Never use random assignment per-request (Math.random() < 0.1). This makes the same user see the new feature sometimes and the old feature other times, which produces confusing UX and invalid A/B test metrics (the same user appears in both treatment and control).

The recommended ramp schedule for a typical feature:

  • 1% (QA validation, 24 hours): Verify the feature works correctly in production. Check error rates, latency, and business metrics for the treatment group vs. control. Fix bugs before ramping.
  • 10% (initial validation, 48–72 hours): Check that metrics are consistent across the 1–10% range. Watch for P99 latency impact. Verify the feature behaves correctly for all user segments (mobile vs. web, logged-in vs. guest).
  • 50% (broad validation, 48–72 hours): Watch for metrics that only appear at scale — connection pool exhaustion, cache invalidation patterns, rate limiting from third-party APIs.
  • 100% (full launch): Remove the flag from the code, deprecate the flag in the system, create the cleanup ticket.

Ring deployments (used by Microsoft, AWS) are an alternative to percentage rollouts: instead of a random 10% of users, you define rings — internal employees, beta users, specific geographic regions, small customers, large customers. Each ring has well-understood risk tolerance. This is more complex to implement but gives more precise control over who sees the feature and when.

The Kill Switch — Your Most Important Flag

Every service that depends on an external or slow dependency should have a kill switch. This is the pattern that makes the difference between a graceful degradation and a cascading failure.

Kill switch design principles:

(1) Define the fallback behavior before the flag. A kill switch is useless if the code does not have a defined fallback for the disabled case. Before you can disable personalized recommendations, you need to know: what do users see instead? (Popular items? Cached last recommendations? Empty state?) The fallback must be implemented and tested before the kill switch is meaningful.

(2) Kill switches should default to on. The flag should be "disable_feature: false" rather than "enable_feature: true." This means when the flag evaluation system itself has a failure, the feature remains enabled — which is the safer default for most features. Exception: features that have known scaling problems under certain conditions may default to disabled.

(3) They must propagate quickly. A kill switch that takes 5 minutes to propagate to all serving instances is not useful during a cascading failure. Kill switch propagation should be via streaming push (SSE or WebSocket), not polling. Target: flag change visible in all instances within 30 seconds. LaunchDarkly's streaming system achieves this in under 500ms.

(4) Test them regularly. Kill switches that are never exercised tend to be broken when you need them. Run a quarterly kill switch exercise: flip the switch in a staging environment, verify the fallback behavior is correct, flip it back. This is the equivalent of fire drills for production resilience.

Kill switch anatomy for a recommendation service:

def get_recommendations(user_id: str) -> list[Product]:
    if flag_client.is_enabled("ops_disable_personalized_recs", user_id):
        # Kill switch is active — return the fallback immediately,
        # do not call the recommendation service
        return get_popular_items(limit=10)

    try:
        return recommendation_service.get_for_user(user_id, timeout=0.5)
    except (TimeoutError, ServiceUnavailableError):
        # Also fall back if the service is slow/down
        # Kill switch lets us do this preemptively without a code change
        return get_popular_items(limit=10)

Dark Launch — Validating System Behavior Before User-Facing Traffic

The dark launch pattern: send real production traffic to new code or a new service, but do not show the results to users. Compare the new behavior against the old behavior in the background.

This is the safest possible way to validate that new code is correct under production traffic patterns before any user sees a different experience.

Dark launch use cases:

  • Validating a new search ranking model without changing search results users see
  • Testing a new database query that should return the same results as the old one
  • Load testing a new service version under real traffic before cutover
  • Validating that a new payment gateway produces the same charge amounts as the old one

Implementation pattern (shadow traffic):

def handle_search_request(query: str, user_id: str) -> SearchResults:
    # Primary path — always runs, results returned to user
    primary_results = current_search_engine.search(query, user_id)

    # Dark launch path — runs in background if flag is enabled
    if flag_client.is_enabled("dark_launch_new_search", user_id):
        asyncio.create_task(
            shadow_search_and_compare(query, user_id, primary_results)
        )

    return primary_results

async def shadow_search_and_compare(query, user_id, primary_results):
    try:
        shadow_results = new_search_engine.search(query, user_id)
        # Log divergence for analysis — do not affect the user's response
        if shadow_results != primary_results:
            metrics.increment("search.shadow.divergence",
                              tags={"query_type": classify_query(query)})
    except Exception as e:
        # Shadow failures are non-fatal — never let them affect the primary response
        logger.warning("Shadow search failed", error=str(e))

The key constraint: the shadow path must be fire-and-forget. It must not block the primary response, must not add to the primary response latency, and must not propagate exceptions. Shadow failures are expected and acceptable — you are validating new code.

⚠ WARNING

Flag Debt — The Hidden Cost of Feature Flags

Feature flags are technical debt with a lifecycle. A flag that is never cleaned up creates three compounding costs:

Code complexity. Every flag adds an if-branch that the code must maintain. A codebase with 50 active flags has 50 code paths that must be understood when debugging, 50 possible configurations in testing, and 50 potential sources of bugs when flags interact.

Dead code paths. A release flag that has been at 100% for 6 months still has the old code path — the else branch of the flag condition. That code is no longer tested, no longer updated when the surrounding code changes, and may be the source of confusion when a new engineer reads it and wonders "when does this execute?"

Flag evaluation overhead. Every flag in the system is evaluated on every request that checks it. A system with 500 flags where 400 are dead still evaluates 500 flags per request.

The flag hygiene system that works:

  1. Every flag is created with a planned deletion date (for release flags: within 4 weeks of 100% rollout)
  2. Every flag has an owner (not a team — a named engineer)
  3. A monthly audit identifies flags that have been at 0% or 100% for more than 30 days and creates cleanup tickets
  4. Flags older than 90 days that are not explicitly marked as permanent (kill switches, permission flags) are automatically disabled in staging and a Slack notification is sent to the owner
  5. The count of "stale flags" is a metric tracked by the engineering platform team

The principle: the system for managing flag debt must be automated. Engineers do not manually clean up flags consistently; automated reminders and tooling do.

Flag Type Summary — Lifecycle and Architecture

Flag TypeLifecycleDefault StateTargetingCleanupExample
Release flag2–8 weeksOff (dark), ramps to 100%% rollout by user_id hashDelete after 100% + 2 weeks stableenable_new_checkout_flow
Experiment flag2–6 weeks50/50 split or defined allocationUser segment, % splitDelete when experiment concludescheckout_cta_button_variant
Permission flagPermanentVaries by user tier/roleUser property (plan, role, country)Never — update targeting rules insteadshow_enterprise_analytics_tab
Ops flag / Kill switchPermanentOn (feature enabled); flip to disableGlobal or service-levelNever — the safety valve you will need againops_disable_personalized_recs

Level Differentiation: Feature Flag Knowledge by Engineering Level

LevelFlag UsageKill Switch KnowledgeDark LaunchFlag DebtWhat They Miss
L3 / JuniorUses flags to hide unfinished features; knows percentage rollout existsHas heard of kill switches; has not implemented oneDoes not know the patternCreates flags; never deletes themConsistent hashing for stable assignment; kill switch as an architecture pattern; flag debt as a systems problem
L4 / MidGradual rollout with monitoring; understands A/B flag vs release flagHas implemented a kill switch; may not have a fallback behavior definedMay know the concept but not the implementation constraints (fire-and-forget)Creates deletion tickets; doesn't always follow throughFlag evaluation latency impact; ring deployment pattern; dark launch for validating new services
L5 / SeniorKnows all four flag types and uses the right one for each use case; consistent hashingKill switch design: fallback defined first, default-on, fast propagation, tested quarterlyImplements dark launch with fire-and-forget shadow calls; logs divergence for analysisSystematic cleanup process; flags with owners and deletion datesFlag system architecture: evaluation latency, streaming updates, the build-vs-buy decision for the flag service
StaffDesigns the flag system: evaluation architecture, caching strategy, propagation SLOKill switch coverage as a reliability requirement: every external dependency must have oneDark launch as a validation gate before any major system migrationAutomated flag debt tracking; stale flag count as an engineering health metricThis is the target — flags as a production engineering system, not just a product feature

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →