Skip to main content
2026-03-06
11 min read

The 3 Infrastructure Decisions That Determine Your Engineering Velocity

The 3 Infrastructure Decisions That Determine Your Engineering Velocity

TLDR

  • Three decisions matter most: Provisioning model (how you create infrastructure), environment strategy (dev/staging/prod topology), and deployment surface (where code runs)
  • Provisioning: Manual ops → 2-3 days per change. Scripted → 4-8 hours. Terraform → 30-60 minutes. Platform-abstracted → 5-10 minutes
  • Environments: Production-only → fast but risky. Dev/staging/prod → safe but slow (30-90 min deploys). Ephemeral per-PR → fast AND safe (5-15 min feedback loops)
  • Deployment surface: Managed platforms = fastest (minutes to production), VMs = moderate (hours to days), Kubernetes = slowest (weeks to months for first deploy)
  • The velocity tax: Each additional approval gate adds 15-45 minutes per deployment. At 40 deploys/day across a 20-person team, that's ~$31K/month in approval overhead alone
  • Decision principle: Choose the simplest option that meets your requirements. Complexity kills velocity faster than any other factor

The Infrastructure Decisions That Actually Matter

Engineering teams obsess over monitoring tools, service meshes, and database choices. They spend weeks evaluating container runtimes. They debate GitOps vs traditional CI/CD.

These decisions matter, but they're optimizations. They affect developer experience and operational efficiency, but they don't fundamentally change your team's ability to ship quickly.

Three infrastructure decisions have outsized impact on velocity:

  1. Provisioning Model: How you create, modify, and destroy infrastructure
  2. Environment Strategy: Your dev/staging/prod topology and how work flows through it
  3. Deployment Surface: Where your code actually runs

Get these wrong, and your team will struggle no matter how good your other choices are. Get these right, and you'll ship faster than teams with "better" infrastructure.

This guide examines each decision, quantifies the velocity impact, and provides frameworks for choosing correctly based on team size and maturity.


Decision 1: Provisioning Model

The question: When you need a new database, load balancer, or storage bucket, how long does it take from decision to usable resource?

The Four Provisioning Maturity Levels

Level 1: Manual Operations (2-3 days per change)

  • Someone clicks through cloud console UI
  • Takes screenshots for documentation
  • No repeatability or versioning
  • Common in early startups (1-5 engineers)

Velocity impact: Every infrastructure change requires dedicated focus time. Deploying a new service that needs a database, cache, and message queue? That's 3× 2-3 days = 6-9 days of infrastructure work before your first line of application code runs.

Level 2: Scripted Provisioning (4-8 hours per change)

  • Bash scripts or CLI commands
  • Some documentation, minimal versioning
  • Better than manual, but fragile
  • Common in growing startups (5-15 engineers)

Velocity impact: You've eliminated the "where's that button in the console?" tax, but scripts break when cloud APIs change. You'll spend 2-4 hours fixing brittle automation quarterly. At 10 infrastructure changes/month, that's 40-80 hours/month of provisioning work.

Level 3: Infrastructure as Code (Terraform, Pulumi, CloudFormation) (30-60 minutes per change)

  • Declarative configuration
  • Version controlled and reviewed
  • Plan before apply, state management
  • Standard for mid-stage companies (15-50 engineers)

Velocity impact: The goldilocks zone for most teams. Changes are fast enough that infrastructure isn't the bottleneck, but controlled enough that you don't accidentally destroy production. 30-60 minutes from PR to merged infrastructure.

Level 4: Platform-Abstracted (5-10 minutes per change)

  • Developers self-serve through internal platform or managed service
  • Infrastructure provisioned automatically based on application config
  • Examples: Heroku (provisions DB on heroku addons:create), internal IDP with service catalogs
  • Practical for larger companies (50+ engineers with platform team)

Velocity impact: Infrastructure becomes invisible. Developers declare "I need PostgreSQL" and get it without thinking about VPCs, security groups, or backup policies. 5-10 minutes from config change to usable resource.

The Provisioning Velocity Tax

Let's quantify this with a real scenario: Your team builds 2 new microservices per month. Each needs:

  • PostgreSQL database
  • Redis cache
  • S3 bucket for file storage
  • Application secrets
  • Load balancer/ingress

Time to provision per service:

Manual Operations:    3 days × 5 resources = 15 days (120 hours)
Scripted:             6 hours × 5 resources = 30 hours
Infrastructure as Code: 45 min × 5 resources = 3.75 hours
Platform-Abstracted:   7 min × 5 resources = 35 minutes

Monthly cost (2 services at $16,700/engineer-month):

Manual:    240 hours = 1.5 FTE = $25,000/month
Scripted:  60 hours = 0.375 FTE = $6,300/month
IaC:       7.5 hours = 0.047 FTE = $780/month
Platform:  1.2 hours = 0.007 FTE = $120/month

The difference between manual and IaC is $24,220/month in engineer time—nearly a senior engineer's salary.

Choosing Your Provisioning Model

Stay at Level 1 (Manual) if:

  • You're pre-product-market-fit (1-3 engineers)
  • You provision infrastructure less than once/week
  • Your total infrastructure is <10 resources

Move to Level 2 (Scripted) when:

  • You're provisioning infrastructure 2-3×/week
  • Multiple people need to provision similar resources
  • You hit 10-30 total infrastructure resources

Adopt Level 3 (IaC) when:

  • You have 5+ engineers touching infrastructure
  • You need multi-environment (dev/staging/prod) provisioning
  • You're spending >10 hours/week on infrastructure changes

Build Level 4 (Platform) when:

  • You have 50+ engineers
  • You can dedicate 2+ FTEs to platform engineering
  • Developer self-service is a bottleneck (>5 infrastructure requests/day)

Most teams should aim for Level 3 (IaC) and stop. Level 4 is only worth the investment at significant scale.


Decision 2: Environment Strategy

The question: How does code flow from development to production, and how many approval gates exist?

The Environment Spectrum

Pattern 1: Production Only (Deploy time: 5-15 minutes)

Developers push directly to production. No staging environment. Works for:

  • Very early startups (<5 engineers)
  • Teams with comprehensive automated testing
  • Low-risk applications (internal tools, content sites)

Velocity impact: Maximum speed. Feature branches merge to main, CI runs, production deploys. Total time from merge to production: 5-15 minutes.

Risk: No safety net. Bugs reach users immediately. Requires excellent testing culture and fast rollback capability.

Pattern 2: Staging + Production (Deploy time: 30-90 minutes)

The industry standard:

  1. Merge to main
  2. CI deploys to staging
  3. Manual QA/smoke tests
  4. Promote to production (manual or automated)

Velocity impact: Adds 15-75 minutes of waiting between merge and production deploy. At 10 deploys/day across a 10-person team, that's 2.5-12.5 engineer-hours daily waiting for staging validation.

The hidden cost: Staging environments drift from production. Database state differs. Traffic patterns don't match. You'll discover production-only bugs monthly, costing 4-8 hours of debugging each.

Pattern 3: Dev + Staging + Production (Deploy time: 60-180 minutes)

Common in regulated industries:

  1. Develop in local/dev environment
  2. Merge deploys to shared dev environment
  3. Promote to staging for QA
  4. Promote to production after approvals

Velocity impact: Each environment adds 15-45 minutes of wait time. Total time from commit to production: 60-180 minutes depending on automation.

Cost: Maintaining 3 environments costs $2,000-$5,000/month in infrastructure alone for typical web applications. Add 5-10 hours/week of "fixing dev environment" work.

Pattern 4: Ephemeral PR Environments (Feedback time: 5-15 minutes)

Modern approach:

  • Each pull request gets isolated environment
  • Automated tests + human review happen in PR environment
  • Merge to main deploys directly to production
  • PR environments destroyed after merge

Velocity impact: Fastest feedback loops. Reviewers see changes running in real environment within 5-15 minutes of pushing code. No waiting for shared staging environment.

Tools: Vercel/Netlify (frontend), Render Preview Environments, Railway PR Deploys, Kubernetes with Argo CD + preview namespaces

Cost: Variable based on PR volume. Roughly $500-$2,000/month for teams with 20-50 open PRs simultaneously.

The Approval Gate Tax

Every manual approval step adds latency. Let's quantify:

Scenario: 20-engineer team, each engineer deploys 2×/day average

  • Total deploys: 40/day
  • Each manual approval: 15-30 min average (including context switching for approver)
  • Daily cost: 40 × 22.5 min = 900 minutes = 15 hours/day of combined waiting + approval time
  • Monthly cost: 15 hours/day × 20 workdays = 300 hours/month × $104/hour = ~$31,000/month in approval overhead

And that's assuming approvals happen within 30 minutes. In practice, approvals often wait hours for the right person to be available, multiplying this cost.

Choosing Your Environment Strategy

Production-only if:

  • Pre-PMF startup (<5 engineers)
  • Excellent automated test coverage (>80%)
  • Low user impact from bugs (internal tools, dev tools)
  • Fast rollback capability (<5 minutes)

Ephemeral PR + Production if:

  • 5-50 engineers
  • Modern tooling (React, Next.js, containerized services)
  • Fast CI/CD (<10 min test suite)
  • Supported by your deployment platform

Staging + Production if:

  • Complex integrations requiring manual QA
  • Slow test suites (>15 minutes)
  • High cost of bugs in production
  • You're not ready for ephemeral environments

Dev + Staging + Production if:

  • Regulated industry with compliance requirements
  • Multiple external integrations that need isolated testing
  • Explicitly required by customers/contracts

Avoid this unless required. The velocity tax rarely justifies the additional safety.


Decision 3: Deployment Surface

The question: Where does your application code actually run?

This is the "hosting choice" decision: managed platform vs VMs vs containers/Kubernetes.

The Three Deployment Surface Tiers

Tier 1: Managed Platforms (Time to first production deploy: 1-3 days)

Examples: Heroku, Render, Railway, Fly.io, DigitalOcean App Platform, Vercel, Netlify

What they handle:

  • Runtime environment (Node, Python, Ruby, etc.)
  • SSL certificates
  • Load balancing
  • Log aggregation
  • Metrics and health checks
  • Deployment pipeline
  • Auto-scaling (some platforms)

What you handle:

  • Application code
  • Database schema migrations
  • Environment configuration

Velocity impact: Fastest time-to-production. Typical flow:

  1. Connect Git repository (5 minutes)
  2. Configure environment variables (10 minutes)
  3. Deploy (platform builds and deploys automatically)

From zero to production: 1-3 days for a new service including development time.

Cost: $50-$500/month per service for typical web applications.

Trade-off: Less control over infrastructure. Limited customization of networking, OS-level packages, or deployment strategies.

Best for: Stateless web apps, APIs, background workers, frontend applications. Teams under 30 engineers.

Tier 2: Virtual Machines (Time to first production deploy: 1-2 weeks)

Examples: EC2, DigitalOcean Droplets, Linode, Azure VMs

What you handle:

  • OS configuration and patching
  • Runtime installation (Node, Python, etc.)
  • Process management (systemd, supervisord)
  • SSL via Let's Encrypt/ACM
  • Application deployment scripts
  • Log shipping to external service
  • Monitoring agent installation

Velocity impact: Slower initial setup, but full control. Typical flow:

  1. Provision VM and configure networking (2-4 hours)
  2. Install runtime and dependencies (1-2 hours)
  3. Configure deployment automation (4-8 hours)
  4. Set up monitoring/logging (2-4 hours)
  5. Security hardening (2-4 hours)

From zero to production: 1-2 weeks including development.

Cost: $20-$200/month per VM depending on size. Additional monitoring/logging costs.

Trade-off: More flexibility, more maintenance. You're responsible for OS patches, security updates, and daemon management.

Best for: Stateful applications (databases, message queues), legacy applications with specific OS requirements, long-running processes. Teams 10-50 engineers.

Tier 3: Container Orchestration (Kubernetes) (Time to first production deploy: 1-3 months)

Examples: EKS, GKE, AKS, self-managed K8s

What you handle:

  • Cluster provisioning and upgrades
  • Node pool management
  • Networking (CNI, ingress controllers, service mesh)
  • Storage (CSI drivers, PVCs)
  • Deployment manifests (YAML, Helm charts)
  • GitOps tooling (Argo CD, Flux)
  • Observability (Prometheus, Grafana)
  • Security policies (NetworkPolicies, PodSecurityPolicies)

Velocity impact: Slowest initial setup, highest operational complexity. Typical flow for first service:

  1. Provision cluster (4-8 hours with managed K8s)
  2. Configure networking/ingress (8-16 hours)
  3. Set up CI/CD pipeline (16-32 hours)
  4. Create deployment manifests (4-8 hours)
  5. Configure monitoring/logging (8-16 hours)
  6. Security hardening (8-16 hours)
  7. Team training (40-80 hours)

From zero to production: 1-3 months for first service with team ramp-up.

Cost: $200-$1,000/month for managed control plane + nodes. Additional tooling costs (ingress controllers, monitoring, etc.) can add $200-$500/month.

Trade-off: Maximum flexibility and control. Can handle complex deployment strategies, multi-region, advanced networking. Requires dedicated platform engineering time.

Best for: Large organizations (50+ engineers), multi-region requirements, complex microservices architectures, teams with existing K8s expertise.

The Deployment Surface Velocity Matrix

Factor Managed Platform VMs Kubernetes
First deploy 1-3 days 1-2 weeks 1-3 months
Subsequent deploys 5-15 min 10-30 min 15-45 min
New service onboarding 1-4 hours 1-2 days 3-5 days
Team onboarding 1-2 days 3-5 days 2-4 weeks
Maintenance burden <5% FTE 10-15% FTE 20-40% FTE
Debugging complexity Low Medium High
Multi-region support Limited Manual Native
Scaling complexity Automatic Manual/scripts Complex but powerful

The Real Cost: Opportunity Cost

The question isn't "how much does Kubernetes cost?" It's "what could we build with the time we spend on Kubernetes?"

Example: 30-engineer team chooses Kubernetes

  • Initial setup: 120 engineer-hours = $10,000
  • Ongoing maintenance: 20% of 2 platform engineers = 0.4 FTE = $6,680/month
  • Annual cost: $10,000 + ($6,680 × 12) = $90,160/year

Alternative: Same team chooses managed platform

  • Initial setup: 8 engineer-hours = $670
  • Ongoing: 5% of 0.5 FTE = $420/month
  • Annual cost: $670 + ($420 × 12) = $5,710/year

Savings: $84,450/year in engineering time = ~0.5 FTE freed up for product development

Choosing Your Deployment Surface

Choose managed platforms if:

  • Team <30 engineers
  • Standard web applications/APIs
  • No multi-region requirements
  • Willing to trade control for velocity

Choose VMs if:

  • Specific OS-level dependencies
  • Long-running stateful processes
  • Predictable load (no need for rapid scaling)
  • Team 10-50 engineers

Choose Kubernetes if:

  • 50+ engineers with 2+ dedicated platform engineers
  • Multi-region deployment required
  • Complex networking requirements
  • Existing K8s expertise on team

Default to the simplest option. You can always migrate later, but you can't reclaim the engineering time spent on complex infrastructure.


Putting It All Together: Real-World Scenarios

Let's see how these three decisions compound in real teams:

Scenario A: Early Startup (8 Engineers)

Infrastructure choices:

  • Provisioning: Manual via cloud console (Level 1)
  • Environments: Production only
  • Deployment: Render (managed platform)

Velocity profile:

  • Deploy frequency: 15-20×/day across team
  • Average deploy time: 8 minutes
  • Infrastructure changes: 2-3×/week taking 2 hours each
  • Team time on infrastructure: ~5% of capacity

Result: Maximum product velocity. Team ships features fast. Infrastructure is barely a consideration.

When to evolve: At 15+ engineers or when infrastructure changes hit 5+/week.

Scenario B: Growth-Stage Startup (35 Engineers)

Infrastructure choices:

  • Provisioning: Terraform (Level 3 IaC)
  • Environments: Ephemeral PR environments + Production
  • Deployment: Mix of Railway (web services) and DigitalOcean VMs (databases, Redis)

Velocity profile:

  • Deploy frequency: 60-80×/day across team
  • Average deploy time: 12 minutes
  • Infrastructure changes: 10-15×/week taking 45 min each
  • Team time on infrastructure: ~10% of capacity
  • 1 dedicated platform engineer (not full-time)

Result: Balanced velocity and control. Ephemeral environments enable fast feedback. Terraform enables self-service infrastructure for developers. Still shipping quickly without operational burden.

When to evolve: At 60+ engineers or when single-region becomes a scaling bottleneck.

Scenario C: Mid-Stage Company (120 Engineers)

Infrastructure choices:

  • Provisioning: Internal developer platform built on Terraform (Level 4)
  • Environments: Ephemeral per-PR + Staging + Production
  • Deployment: Kubernetes (EKS) with Argo CD

Velocity profile:

  • Deploy frequency: 200-300×/day across team
  • Average deploy time: 18 minutes
  • Infrastructure changes: Self-service via platform (5 min each)
  • Team time on infrastructure: ~3% of capacity (concentrated in 5-person platform team)
  • Platform team: 5 dedicated engineers

Result: High velocity despite complexity. Platform team abstracts Kubernetes complexity. Developers self-serve infrastructure. Most teams don't interact with K8s directly.

Trade-off: 5 engineers maintain platform = ~$1M/year. Only justified at this scale.

Scenario D: Enterprise (400+ Engineers)

Infrastructure choices:

  • Provisioning: Multi-cloud IDP with approval workflows (Level 4)
  • Environments: Dev + Staging + Prod + per-PR ephemeral (for web tier)
  • Deployment: Multi-region Kubernetes across AWS + GCP

Velocity profile:

  • Deploy frequency: 800-1,200×/day across org
  • Average deploy time: 25 minutes (includes compliance checks)
  • Infrastructure changes: Fully self-service
  • Team time on infrastructure: ~2% (concentrated in 20-person platform org)
  • Platform org: 20+ engineers, product managers, SREs

Result: Velocity maintained at scale through automation. Extensive tooling and process required. Complex infrastructure is justified by organization size.

Cost: Platform team is ~$4M+/year. Only viable at enterprise scale with $50M+ engineering budget.


The Velocity Decision Framework

When making infrastructure decisions, use this framework:

Step 1: Define Your Velocity Target

What's acceptable for your stage?

Seed stage (<10 engineers): Deploy 10-20×/day, 5-10 min per deploy
Series A (10-30 engineers): Deploy 30-60×/day, 8-15 min per deploy
Series B+ (30-100 engineers): Deploy 100-200×/day, 10-20 min per deploy
Enterprise (100+ engineers): Deploy 200+/day, 15-30 min per deploy

Step 2: Calculate Current Velocity Tax

Measure these:

Step 3: Identify the Highest-Impact Change

What's your biggest bottleneck?

If provisioning takes >2 hours: Adopt infrastructure as code (Terraform)
If deployment takes >30 min: Reduce environment gates or adopt ephemeral environments
If maintenance >15% FTE: Simplify deployment surface (consider managed platform)

Step 4: Choose Simplicity Over Flexibility

When in doubt, choose the simpler option:

  • Manual provisioning → Scripted → Terraform → Platform (Stop at Terraform unless 50+ engineers)
  • Production-only → Staging+Prod → Ephemeral+Prod (Stop at Ephemeral unless regulated)
  • Managed Platform → VMs → Kubernetes (Stop at VMs unless 50+ engineers)

Complexity is expensive. Every additional layer costs 10-20% of an engineer's time in maintenance.

Step 5: Plan Your Evolution

Don't optimize for year 3 on day 1. Plan to evolve:

Today (5-15 engineers):

  • Manual provisioning or simple scripts
  • Production-only or Production+Staging
  • Managed platforms

Next 12 months (15-30 engineers):

  • Terraform for infrastructure
  • Ephemeral PR environments + Production
  • Still on managed platforms or considering VMs

Next 24 months (30-60 engineers):

  • Terraform + self-service patterns
  • Ephemeral + Production (maybe Staging for critical paths)
  • Evaluating Kubernetes (but probably don't need it yet)

Next 36+ months (60+ engineers):

  • Internal platform (if justified)
  • Kubernetes (if multi-region or complex requirements)
  • Dedicated platform team (2-5 engineers)

Key Takeaways

1. Three decisions determine velocity: Provisioning model, environment strategy, deployment surface. Everything else is secondary.

2. Manual provisioning costs $24K/month more than Terraform at typical growth-stage scale. Automate infrastructure as code by 10 engineers.

3. Each approval gate costs 15-45 minutes per deployment. At 40 deploys/day, that's $4,000+/month in pure waiting time.

4. Managed platforms are fastest for 90% of applications. Don't adopt Kubernetes before 50 engineers unless you have specific requirements.

5. Ephemeral PR environments provide the best balance of safety and speed for modern development workflows.

6. Complexity costs 10-20% engineer capacity per infrastructure layer. A team on Kubernetes + multi-cloud + 5 environments might spend 40%+ of time on infrastructure maintenance.

7. Optimize for iteration speed over theoretical scalability. You can migrate to more complex infrastructure later. You can't reclaim engineering time spent maintaining unnecessary complexity.


The Bottom Line

The best infrastructure for your team is the simplest one that meets your actual requirements.

Not the infrastructure your competitors use. Not the infrastructure in conference talks. Not the infrastructure that sounds impressive in blog posts.

The infrastructure that lets your team ship features quickly, safely, and without operational burden.

Most teams should:

  • Use Terraform for infrastructure provisioning (Level 3)
  • Adopt ephemeral PR environments + production (unless regulated)
  • Deploy to managed platforms (until 30+ engineers)

This combination provides:

  • ~15 minute time from merge to production
  • <10% engineer capacity spent on infrastructure
  • Minimal operational complexity
  • Easy onboarding for new team members

Everything else is optimization. Start simple, evolve as needed, and always measure the velocity tax of complexity before adopting it.

Published: 2026-03-06|Last updated: 2026-03-06T10:00:00Z

Found an issue?