Cloud Cost Optimization: From Runaway Bills to Unit Economics
Senior engineer playbook for cloud cost optimization: EC2 spot vs. reserved instances, S3 lifecycle tiers, inter-AZ egress, Karpenter bin packing, and FinOps chargeback. Real numbers from Netflix and Lyft. Essential for FAANG infrastructure interviews.
Why Engineers Ignore Cost Until It's a Crisis
Cloud cost is the engineering problem that compounds invisibly. When you're building, the AWS bill is someone else's problem — finance deals with it. When you're scaling, costs grow linearly with usage and nobody flags the slope. The bill triples when a batch job that ran once a week starts running hourly, when a microservices refactor doubles inter-AZ traffic, or when a new ML training pipeline launches 200 GPUs and never terminates them because the job succeeded but the cleanup script had a bug.
The typical trajectory: a startup reaches $50K/month and the CTO sees it for the first time. By $200K/month, they hire a platform engineer to "optimize infrastructure." By $500K/month, the board is asking about unit economics and cost-per-user. Netflix famously moved aggressively to spot instances and saved over $60M/year (Netflix Tech Blog, 2015: "Netflix's Approach to the Cloud") — but they started that work only after cost became a board-level concern.
Cloud cost optimization is not a one-time project. It is a continuous engineering discipline with three distinct phases:
- Discovery: understand what you are spending and why (cost allocation, tagging)
- Rightsizing: eliminate wasted capacity before buying commitments
- Commitment purchasing: lock in discounts only after usage patterns stabilize
The biggest mistake engineers make is purchasing Reserved Instances before rightsizing. You lock in a discount on a waste pattern and then can't change it for a year.
What Interviewers Are Testing
Mid-level signal: Knows that reserved instances are cheaper than on-demand. Mentions autoscaling as a cost tool. May suggest spot instances for batch workloads.
Senior signal: Can name the specific discount tiers (40% for 1-yr RI, 60-70% for 3-yr, 70-90% for spot). Distinguishes between savings plans ($/hour commitment, flexible) and RIs (specific instance type). Knows that rightsizing must precede commitment purchasing. Understands that egress costs are often the biggest surprise at scale.
Staff signal: Frames cost as unit economics (cost per request, cost per user, cost per GB processed) not absolute spend. Designs the cost architecture for the whole system — which components run on spot, which on RIs, which on savings plans, and why. Knows that Karpenter's topology-aware scheduling can eliminate entire classes of inter-AZ egress cost. Can articulate the FinOps chargeback model and why it changes engineering behavior.
The Cloud Cost Optimization Sequence
Step 1 — Tag Everything (Cost Allocation Foundation)
Before any optimization, you need visibility. Enforce mandatory tags on every resource: team, service, environment (prod/staging/dev), cost-center. Without tagging, AWS Cost Explorer shows you a total bill with no attribution. With tagging, you can answer 'which team owns the $80K Redshift cluster that nobody uses.' Use AWS Organizations SCPs or Terraform enforcement to block untagged resource creation.
Step 2 — Rightsize Before Committing (Eliminate Waste)
Run AWS Compute Optimizer or Datadog's rightsizing recommendations across your fleet. Most EC2 instances run at 10-20% CPU utilization — meaning 80% of the compute you are paying for is idle. The key insight: size based on p50 utilization for batch workloads, not p99. For online services, you need p99 headroom, but for batch jobs you control the schedule and can tolerate slower throughput. Lyft cut costs 20% purely by rightsizing EC2 instances using 2 weeks of CloudWatch metrics.
Step 3 — Classify Workloads (Spot vs. Reserved vs. On-Demand)
Categorize every workload: (a) baseline load that runs 24/7 with no interruption tolerance → reserved instances or savings plans; (b) batch/ML training that can tolerate interruption and retry → spot instances; (c) spiky, unpredictable, short-lived workloads → on-demand with autoscaling. Never use on-demand for stable baseline load — it's paying retail when wholesale is available.
Step 4 — Purchase Commitments (Lock In Discounts)
After rightsizing, buy 1-year Compute Savings Plans to cover your steady-state baseline. Savings Plans are more flexible than RIs — you commit to a $/hour spend rate, not a specific instance type, so you can change instance families as you rightsize further. Use convertible RIs for services where the instance type might change. Only buy 3-year commitments for infrastructure you are certain about (databases, core API servers, Kubernetes control plane nodes).
Step 5 — Implement Storage Lifecycle Policies (Automate Decay)
Set S3 lifecycle rules on every bucket. Log buckets: move to S3-IA at 30 days ($0.0125/GB vs. $0.023/GB), move to Glacier at 90 days ($0.004/GB), delete at 1 year. ML artifact buckets: move non-recent model checkpoints to S3-IA after 14 days. Enable S3 Intelligent-Tiering on buckets with unknown access patterns — it adds $0.0025/1000 objects monitoring cost but automatically moves objects between tiers based on access frequency.
Step 6 — Audit Egress and Restructure Traffic Patterns
Run a Cost Explorer egress breakdown. Egress costs are often 20-30% of total cloud spend for data-heavy workloads and are invisible until you look. Key interventions: co-locate producer and consumer microservices in the same AZ; use VPC endpoints for S3 and DynamoDB to eliminate NAT gateway charges; use CloudFront as a CDN to shift internet egress from EC2/S3 rates ($0.09/GB) to CloudFront rates ($0.0085/GB for high volume).
Step 7 — Implement Kubernetes Efficiency (Bin Packing)
Replace Cluster Autoscaler with Karpenter. Enable VPA to set accurate resource requests based on observed usage. Review pod resource requests — over-provisioned requests (requesting 4 CPU when the pod uses 0.5 CPU) waste reserved node capacity. Set LimitRange defaults in Kubernetes namespaces to prevent unbounded resource requests.
Step 8 — Build Unit Economics Dashboards (FinOps Discipline)
Track cost per request, cost per active user, and cost per GB processed — not just absolute spend. Unit economics lets you know if your cost efficiency is improving even as absolute spend grows with scale. Set budget alerts at 20% over baseline; implement chargeback so teams see their own spend in sprint reviews. Chargeback changes engineering behavior more than any other intervention.
Compute Optimization: The Discount Ladder
AWS EC2 pricing is structured as a discount ladder from most expensive to cheapest. Understanding the mechanics — not just the names — is what lets you architect cost-efficiently.
On-demand: Full list price. No commitment. Appropriate only for truly unpredictable, short-lived workloads where you cannot forecast demand. For any stable workload, on-demand is paying a 60-100% premium over what you should be paying.
Savings Plans (preferred over RIs for most workloads): You commit to a minimum $/hour spend across any EC2 instance in any region. Compute Savings Plans give you 66% savings vs. on-demand; EC2 Instance Savings Plans give 72% savings but lock you to a specific instance family. The key advantage over RIs: if you rightsize from m5.2xlarge to m5.xlarge mid-year, your savings plan still applies. With an RI, you are committed to the m5.2xlarge until expiry.
Reserved Instances (RIs): Commit to a specific instance type, AZ, and region. Standard 1-year RI: ~40% savings. Standard 3-year RI: ~60-70% savings. Convertible RI: ~45% savings but allows changing instance type mid-term. Use RIs only for infrastructure that will not change — RDS databases, ElasticSearch clusters, core Kubernetes nodes.
Spot/preemptible instances: AWS Spot gives 70-90% savings but can reclaim with 2-minute notice. GCP preemptible instances give 60-80% savings with 30-second notice. Spot instances are not risky for the right workloads — they are the standard choice for ML training, batch ETL, CI/CD runners, and stateless web tier scaling. The non-obvious insight: spot interruption rates are usually 1-5% for most instance types in most AZs. For a job that runs 4 hours, the expected interruption probability is low — especially if you use Spot Fleet with diversification across multiple instance types and AZs.
Savings Plans + Spot is the production pattern: Use Savings Plans to cover your baseline 24/7 load at ~66% discount. Run all batch and training workloads on spot. Run scale-out traffic spikes on spot or on-demand (autoscaling handles the delta). Netflix runs ~70% of its compute on spot (Netflix Tech Blog, 2015); typical ML training clusters should target 70% spot + 30% on-demand for fault tolerance (AWS re:Invent 2022, "Cost Optimization at Scale").
EC2 Pricing Tier Comparison — When to Use Each
| Pricing Model | Discount vs. On-Demand | Commitment | Interruption Risk | Best For |
|---|---|---|---|---|
| On-Demand | 0% | None | None | Short-lived, unpredictable, compliance-sensitive workloads only |
| Savings Plan (Compute) | ~66% | 1 or 3 yr $/hr | None | Baseline 24/7 load across any instance family — most flexible RI alternative |
| Savings Plan (EC2 Instance) | ~72% | 1 or 3 yr $/hr | None | Stable load in a specific instance family you're committed to |
| Reserved Instance (Standard 1-yr) | ~40% | 1 yr, specific type | None | Stable DB or service on a fixed instance type for at least 12 months |
| Reserved Instance (Standard 3-yr) | ~60-70% | 3 yr, specific type | None | Core infrastructure (control plane, primary DB) with 3-yr horizon certainty |
| Spot Instance | 70-90% | None | 2-min notice (AWS) | ML training, batch ETL, CI/CD, stateless web tier scale-out |
Rightsizing: The 80% Idle Problem
AWS internal data and third-party studies consistently find that the average EC2 instance runs at 10-20% CPU utilization (Gartner, 2021: "Optimizing Cloud Infrastructure Costs"; AWS Compute Optimizer documentation). This means 80% of the compute you are paying for produces no value. Rightsizing is the single highest-ROI cost action — and it must happen before you purchase any commitments.
How to rightsize correctly:
Step 1: Gather real utilization data. Use AWS Compute Optimizer (free, analyzes 14 days of CloudWatch metrics) or your APM tool (Datadog, New Relic, Prometheus). Look at CPU, memory, network I/O, and disk I/O. The recommendation engine will flag instances where observed maximum usage is well below instance capacity.
Step 2: Use p50 for batch, p99 for online. This is the insight most engineers miss. For batch jobs — ML training, data pipelines, ETL — the job controls its own resource consumption and you can tune parallelism. Size the instance so that p50 job utilization fits comfortably. For online services, you need headroom for traffic spikes, so size for p99 with 20-30% buffer.
Step 3: Test before committing. Change one service at a time. Run the smaller instance for 2 weeks and monitor latency, error rates, and queue depth. A batch job that previously ran in 2 hours may run in 2.5 hours on a smaller instance — if that's acceptable, the savings outweigh the tradeoff.
Step 4: Memory is often the binding constraint, not CPU. Java services, ML inference servers, and in-memory databases frequently pin on memory while CPU is idle. A c5 instance (compute-optimized) costs 20% less than m5 but has no memory advantage. If your service is memory-bound, moving to r5 (memory-optimized) can let you run half as many instances.
Lyft's ~20% cost reduction through rightsizing took 6 weeks of analysis and 3 months of careful instance size changes across ~50 services (Lyft Engineering Blog, 2019: "Optimizing Cloud Costs at Lyft"). The key: they used actual observed utilization from Prometheus, not instance-launch-time estimates.
Storage Optimization: The S3 Tier Hierarchy
S3 storage costs seem small in isolation — $0.023/GB/month is less than 3 cents. But production systems accumulate data at scale: 100TB of logs costs $2,300/month. With proper lifecycle policies, the same 100TB costs ~$400/month. The engineering effort is a one-time Terraform config.
The S3 storage class hierarchy (AWS pricing, us-east-1 — aws.amazon.com/s3/pricing):
- S3 Standard: $0.023/GB/month. No retrieval fee. Use for frequently accessed data (last 30 days of logs, model serving artifacts, active feature stores).
- S3 Intelligent-Tiering: $0.023/GB/month (frequent tier) + $0.0025/1,000 objects/month monitoring fee. Automatically moves objects between Standard and Infrequent Access based on access patterns. Best for buckets where you cannot predict access patterns — e.g., ML experiment artifacts where some experiments get revisited and others never do.
- S3 Infrequent Access (S3-IA): $0.0125/GB/month storage + $0.01/GB retrieval. 46% cheaper than Standard for storage, but adds retrieval cost. Break-even: if you retrieve the data less than once per month, S3-IA is cheaper. Use for data accessed monthly or less: compliance archives, old feature engineering outputs, historical datasets.
- S3 Glacier Instant Retrieval: $0.004/GB/month + $0.03/GB retrieval. 83% cheaper than Standard. Retrieval in milliseconds. Use for data accessed a few times per year: model training checkpoints from completed projects, old experiment logs, regulatory archives.
- S3 Glacier Deep Archive: $0.00099/GB/month. 95% cheaper than Standard. 12-hour retrieval time. Use for cold data that is kept for compliance only and may never need to be accessed: audit logs beyond 1 year, raw data for lineage purposes.
The production lifecycle policy pattern:
0-30 days: S3 Standard ($0.023/GB)
30-90 days: S3-IA ($0.0125/GB, save 46%)
90-365 days: Glacier Instant ($0.004/GB, save 83%)
365+ days: Glacier Deep Archive ($0.001/GB, save 96%) or delete
For most log pipelines, deleting after 1 year is the right choice — storage of year-old logs rarely justifies the cost vs. the probability of access.
S3 Storage Class Cost Comparison — Production Decision Matrix
| Storage Class | Storage Cost/GB/mo | Retrieval Cost/GB | Retrieval Time | Best Use Case |
|---|---|---|---|---|
| S3 Standard | $0.023 | Free | Milliseconds | Active data: last 30 days of logs, model serving artifacts |
| S3 Intelligent-Tiering | $0.023 + $0.0025/1K objects | Free (auto-managed) | Milliseconds | Unknown access patterns, ML experiment artifacts |
| S3 Infrequent Access | $0.0125 | $0.01 | Milliseconds | Monthly-or-less access: old feature stores, historical datasets |
| Glacier Instant Retrieval | $0.004 | $0.03 | Milliseconds | Quarterly access: completed project checkpoints, compliance archives |
| Glacier Flexible Retrieval | $0.0036 | $0.01 (bulk) | 3-5 hours | Annual access: old training runs, regulatory archives |
| Glacier Deep Archive | $0.00099 | $0.0025 (bulk) | 12 hours | Compliance-only cold storage: audit logs, data lineage |
Egress Cost Flows — Where Your Data Budget Goes
Kubernetes Cost Efficiency: HPA, VPA, and Karpenter
Kubernetes introduces a specific class of cost inefficiency: wasted reserved capacity from over-provisioned resource requests. When a pod requests 4 CPU but uses 0.5 CPU, the scheduler treats those 3.5 CPU as occupied. The node is at "full capacity" but physically underutilized. You pay for an extra node when existing nodes have plenty of available physical capacity.
The three autoscaling primitives:
HPA (Horizontal Pod Autoscaler): Scales the number of pod replicas based on observed metrics (CPU utilization, custom metrics via KEDA, Prometheus). HPA is the primary cost-control tool for online services — it scales down during off-peak hours. A web service that needs 20 pods at peak and 3 pods at night saves 85% of compute during the 14 hours it runs at low traffic.
VPA (Vertical Pod Autoscaler): Analyzes historical resource usage and updates pod resource requests to match actual consumption. VPA fixes the over-provisioning problem — it will lower a pod's CPU request from 4 CPU to 0.8 CPU based on observed usage. Warning: VPA requires pod restarts to apply changes and is incompatible with HPA on the same CPU metric. Production pattern: use VPA in recommendation-only mode to surface rightsizing opportunities, then apply them manually.
Karpenter (next-generation node provisioner): Karpenter replaces the Kubernetes Cluster Autoscaler. The key differences: Cluster Autoscaler can only add nodes from pre-configured node groups; Karpenter provisions any node type that fits pending pods. Karpenter intelligently chooses spot instances when workloads can tolerate interruption (controlled via pod annotations), right-sizes nodes to the pod requirements (no wasted capacity on oversized nodes), and consolidates workloads by evicting pods onto fewer nodes and terminating underutilized nodes. In production deployments, Karpenter typically reduces node count by 20-40% compared to Cluster Autoscaler through better bin packing.
The bin packing insight: bin packing efficiency depends entirely on accurate pod resource requests. If every pod requests 2x more than it uses, bin packing is impossible — the scheduler sees nodes as full when they have physical headroom. The tooling stack for accurate requests: VPA recommendation mode → Goldilocks (open-source VPA dashboard) → manual review → apply and monitor.
Kubernetes Cost Efficiency Stack — From Pod to Node
FinOps: Making Cost Visible to the Teams That Own It
FinOps (Financial Operations) is the organizational discipline of making cloud costs visible, attributable, and actionable at the team level. The core insight: engineers don't optimize what they can't see. When the AWS bill is a single number reviewed by the CFO once a month, no individual team has a reason to care about cost. When each team sees their own weekly spend in their sprint review, cost optimization becomes part of engineering culture.
Cost allocation tags — the foundation:
Every cloud resource must be tagged with at minimum: team, service, environment (prod/staging/dev), cost-center. Use AWS Organizations Service Control Policies (SCPs) to block resource creation without mandatory tags. Without tagging enforcement, you will have 30-40% of your bill as "untagged" with no attribution within 6 months. AWS Cost Explorer and tools like Infracost, CloudHealth, or Kubecost can then generate per-team spend reports from tag data.
Showback vs. Chargeback:
- Showback: teams receive visibility into their cloud spend but are not billed for it internally. Low friction, good starting point.
- Chargeback: teams are actually charged against their cost center or team budget for their cloud usage. Higher friction to implement (requires finance integration) but produces dramatically stronger cost optimization behavior. When an ML team's training runs come out of their headcount budget equivalent, they start thinking carefully about spot vs. on-demand and whether to keep 200 GPUs running overnight.
Unit economics — the maturity signal: Absolute spend is a vanity metric when the company is growing. A company with $500K/month AWS spend and 10M users is more cost-efficient than one with $100K/month AWS spend and 50K users. Track:
- Cost per request (total infra cost / request volume) — tracks infrastructure efficiency as load scales
- Cost per active user per month (total infra cost / MAU) — tracks product cost efficiency
- Cost per GB processed (for data pipelines) — tracks pipeline efficiency
Set budget alerts at 10-20% over the 30-day rolling baseline. Wire them to PagerDuty or Slack, not just email — an unexpected 40% spend spike at 2am from a runaway training job should wake someone up, not appear in Monday's report.
The Five Most Expensive Cost Mistakes in Production
1. Purchasing Reserved Instances before rightsizing. You lock in a 1-year discount on a waste pattern. The RI is non-transferable to a smaller instance type. Rightsize first, always.
2. No lifecycle policies on S3 buckets. S3 bills compound silently. A bucket with no lifecycle policy accumulates years of data at Standard rates. One Terraform block eliminates this permanently.
3. Ignoring inter-AZ egress in microservices. A 20-service architecture with no AZ affinity policy can pay $50K+/month for internal traffic. Architects don't think about this because the traffic diagram doesn't show dollar signs.
4. Over-provisioned Kubernetes resource requests. Requesting 4 CPU when the pod uses 0.5 CPU makes bin packing impossible. You pay for 8x more nodes than needed. VPA recommendation mode costs nothing to enable and reveals this immediately.
5. Spot instances without checkpointing. Running a 24-hour ML training job on spot without checkpointing means a 2-minute interruption at hour 22 restarts the job from scratch. Implement checkpoint-every-30-minutes and spot interruption handlers. Then spot is safe; without it, spot is a gamble that costs you when it loses.
Interview Summary — What to Say to Stand Out
Lead with the sequence: tag → rightsize → then commit. Engineers who say "buy RIs" first reveal they don't understand the optimization order.
Call out egress as the hidden killer. Most interviewers are impressed when candidates mention inter-AZ egress costs specifically — it's a concrete signal of production experience vs. tutorial knowledge.
Frame everything in unit economics: cost per request, cost per user. This signals staff-level thinking — you're optimizing a ratio, not chasing an absolute number down.
For Kubernetes: mention Karpenter by name and explain why it's better than Cluster Autoscaler (topology awareness, spot support, right-sizing, consolidation). Cluster Autoscaler is the known answer; Karpenter is the current production standard.
For FinOps: distinguish chargeback from showback and state that chargeback changes engineer behavior. This is the organizational insight that separates someone who has done FinOps from someone who has read about it.
Interview Questions
Click to reveal answersSign in to take the Quiz
This topic has 15 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.
Sign in to take quiz →