Capacity Planning and Scaling

How do you approach capacity planning for a growing production system? What metrics and strategies do you use?

senior

advanced

SRE

Question

How do you approach capacity planning for a growing production system? What metrics and strategies do you use?

Answer

Capacity planning ensures systems can handle current and future load. Process: 1) Establish baselines - current CPU, memory, disk, network utilization and request rates. 2) Understand growth patterns - historical trends, seasonality, planned campaigns. 3) Define headroom - typically 30-40% buffer for unexpected spikes. 4) Model scenarios - what happens at 2x, 5x, 10x traffic? 5) Identify bottlenecks - database connections, API rate limits, stateful components. 6) Plan scaling strategy - vertical vs horizontal, auto-scaling policies. 7) Load test regularly. Review capacity quarterly.

Why This Matters

Capacity planning is both art and science. Too much capacity wastes money; too little causes outages. Cloud auto-scaling helps but doesn't solve everything - databases, third-party APIs, and stateful services often can't scale horizontally. Senior engineers must think about bottlenecks that aren't obvious and plan for Black Friday scenarios before they happen.

Code Examples

Horizontal Pod Autoscaler

yaml

Capacity analysis queries

bash

Common Mistakes

Only planning for average load, not peak load
Forgetting about dependent services that may become bottlenecks
Not accounting for the time it takes to scale (cold start, provisioning)

Follow-up Questions

Interviewers often ask these as follow-up questions

How do you handle capacity planning for stateful services like databases?
What is the difference between scaling up and scaling out?
How do you account for third-party API rate limits in capacity planning?

Also worth your time on this topic

Interview

Application Performance Optimization

How do you identify and resolve performance bottlenecks in a production application?

mid

Checklist

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.

45-90 minutes

Article

AI SRE Agents: What They Actually Fix, and What They Will Happily Break

AI SRE is now its own category, with every incident vendor shipping an agent that investigates and remediates on its own. Here is the honest split: where these agents genuinely earn their keep, where they are oversold, and the one risk nobody puts on the marketing page.

Capacity Planning and Scaling

More SRE interview questions

Also worth your time on this topic

Application Performance Optimization

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

AI SRE Agents: What They Actually Fix, and What They Will Happily Break