Common Cloud Cost Challenges for DevOps Teams

Identify and understand the most frequent cloud cost problems that DevOps teams encounter and why they happen.

DevOps teams face unique cloud cost challenges that stem from the intersection of rapid deployment practices, complex cloud pricing models, and traditional financial processes. Understanding these challenges helps you identify cost optimization opportunities and prevent common pitfalls.

Over-Provisioning: The "Just to Be Safe" Problem

Over-provisioning happens when teams allocate more resources than actually needed, often as insurance against performance issues. This practice stems from on-premises thinking where adding capacity required lengthy procurement processes.

Why Over-Provisioning Happens

Fear of Performance Issues: Teams often choose larger instance sizes to avoid potential slowdowns, especially for customer-facing applications.

Lack of Right-Sizing Knowledge: Without clear usage data, teams make educated guesses that tend toward over-allocation.

One-Size-Fits-All Approaches: Using the same instance type for all environments, even when development and staging don't need production-level resources.

Here's a typical example of over-provisioning in Terraform:

# Common over-provisioning pattern
resource "aws_instance" "web" {
  count         = 3
  ami           = "ami-0abcdef1234567890"
  instance_type = "m5.2xlarge"  # Often oversized for actual needs

  tags = {
    Name = "web-server-${count.index}"
    Environment = var.environment
  }
}

# Data volume that's larger than needed
resource "aws_ebs_volume" "data" {
  availability_zone = "us-west-2a"
  size              = 1000  # Often allocated based on worst-case estimates
  type              = "gp3"

  tags = {
    Name = "application-data"
  }
}

The Real Cost Impact

Over-provisioning compounds quickly across environments and teams. A single over-sized production instance might cost an extra $200/month, but when multiplied across development, staging, and production environments for multiple teams, the waste becomes significant.

Example calculation:

Production: 3 x m5.2xlarge ($0.384/hour) = $829/month
Right-sized: 3 x m5.large ($0.096/hour) = $207/month
Monthly waste: $622 per application

Identifying Over-Provisioning

You can identify over-provisioned resources by monitoring utilization metrics:

# AWS CLI command to get EC2 CPU utilization for the last 30 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -d '30 days ago' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average,Maximum

Resources consistently running below 40% CPU utilization are candidates for right-sizing.

Zombie Resources: The Hidden Cost Drain

Zombie resources are cloud resources that continue running and generating charges long after they're needed. These often result from development workflows, testing, or failed cleanup processes.

Common Types of Zombie Resources

Forgotten Development Environments: Temporary environments created for testing that never get cleaned up.

Orphaned Load Balancers: Load balancers that continue running after their target instances are terminated.

Unused Storage Volumes: EBS volumes, S3 buckets, or database snapshots that are no longer attached to active resources.

Idle Databases: Database instances that were created for testing and left running indefinitely.

Here's an example of how zombie resources accumulate in typical development workflows:

# Developer creates a feature branch environment
git checkout -b feature/new-payment-flow
terraform workspace new feature-payment-flow
terraform apply  # Spins up full infrastructure

# Feature gets merged and deployed
git checkout main
git branch -d feature/new-payment-flow

# Terraform workspace and infrastructure remain running
# Monthly cost: $150-500 depending on services

The Zombie Detection Problem

Zombies are difficult to detect because:

They often have minimal usage, avoiding basic monitoring alerts
Resource tags may not clearly indicate their purpose or ownership
Development teams may fear deleting resources they don't fully understand

Implementing Zombie Detection

The key to zombie detection is combining resource metadata with usage patterns. You want to identify resources that are running but not actually being used productively.

A simple approach focuses on two main indicators:

Age: Resources that have been running for extended periods
Utilization: Very low CPU, network, or disk activity

Here's a practical script that identifies potential zombie EC2 instances by checking these criteria:

import boto3
from datetime import datetime, timedelta

def find_zombie_instances():
    """Find EC2 instances that might be zombies based on age and low utilization"""
    ec2 = boto3.client('ec2')
    cloudwatch = boto3.client('cloudwatch')

    # Get all running instances
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    zombies = []

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            launch_time = instance['LaunchTime']

            # Check if instance is older than 30 days
            days_running = (datetime.now(launch_time.tzinfo) - launch_time).days
            if days_running > 30:

                # Check average CPU over the last week
                avg_cpu = get_average_cpu_utilization(cloudwatch, instance_id)

                # Flag as potential zombie if very low CPU usage
                if avg_cpu is not None and avg_cpu < 5:
                    zombies.append({
                        'InstanceId': instance_id,
                        'DaysRunning': days_running,
                        'AvgCPU': avg_cpu
                    })

    return zombies

def get_average_cpu_utilization(cloudwatch, instance_id):
    """Get average CPU utilization for an instance over the last 7 days"""
    try:
        response = cloudwatch.get_metric_statistics(
            Namespace='AWS/EC2',
            MetricName='CPUUtilization',
            Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
            StartTime=datetime.now() - timedelta(days=7),
            EndTime=datetime.now(),
            Period=3600,  # 1 hour periods
            Statistics=['Average']
        )

        if response['Datapoints']:
            return sum(dp['Average'] for dp in response['Datapoints']) / len(response['Datapoints'])
        return None
    except Exception as e:
        print(f"Could not get metrics for {instance_id}: {e}")
        return None

This script identifies instances that have been running for more than 30 days with less than 5% average CPU utilization. You can customize these thresholds based on your environment.

To use this effectively:

Run it weekly as part of your cost optimization routine
Review flagged instances before taking action - some may have legitimate low usage
Consider adding checks for network or disk utilization for more accuracy
Always verify with the resource owner before terminating anything

Lack of Cost Attribution: The Shared Account Problem

Many organizations use shared cloud accounts across multiple teams, projects, or environments. Without proper cost attribution, it becomes impossible to understand which teams or projects are driving costs.

Why Cost Attribution Fails

Inconsistent Tagging: Teams use different tagging conventions or forget to tag resources entirely.

Shared Resources: Infrastructure components like networking or monitoring that serve multiple applications.

Resource Sprawl: As teams scale, keeping track of resource ownership becomes increasingly difficult.

The Business Impact

Without cost attribution:

Teams can't make informed decisions about feature costs
Finance can't allocate cloud spend to appropriate budgets
Cost optimization efforts lack clear priorities
Accountability for cloud spend becomes diffused

Implementing Effective Cost Attribution

A robust tagging strategy is essential for cost attribution:

# Terraform locals for consistent tagging
locals {
  common_tags = {
    Project     = var.project_name
    Environment = var.environment
    Team        = var.team_name
    Owner       = var.owner_email
    CostCenter  = var.cost_center
    ManagedBy   = "terraform"
    CreatedDate = formatdate("YYYY-MM-DD", timestamp())
  }
}

# Apply tags consistently across all resources
resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type

  tags = merge(local.common_tags, {
    Name        = "${var.project_name}-web-${var.environment}"
    Component   = "web-server"
    Billable    = "true"
  })
}

resource "aws_s3_bucket" "assets" {
  bucket = "${var.project_name}-assets-${var.environment}"

  tags = merge(local.common_tags, {
    Component = "static-assets"
    Backup    = "daily"
  })
}

Automated Tag Enforcement

Use cloud provider policies to enforce tagging requirements:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": ["ec2:RunInstances"],
      "Resource": ["arn:aws:ec2:*:*:instance/*"],
      "Condition": {
        "Null": {
          "aws:RequestedRegion": "false",
          "ec2:ResourceTag/Project": "true",
          "ec2:ResourceTag/Environment": "true",
          "ec2:ResourceTag/Team": "true"
        }
      }
    }
  ]
}

This IAM policy prevents launching EC2 instances without required tags.

Development and Production Parity Costs

Many teams mirror production environments for development and staging, leading to unnecessary costs for non-production workloads that don't require production-level resources.

The Mirroring Trap

Database Sizing: Running production-sized databases in development environments
Multi-AZ Deployments: Using high-availability configurations in single-developer environments
Storage Replication: Maintaining production backup and replication settings in test environments

Smart Environment Scaling

Implement environment-specific configurations:

# Variable-driven resource sizing based on environment
variable "environment_configs" {
  type = map(object({
    instance_type     = string
    min_capacity      = number
    max_capacity      = number
    multi_az          = bool
    backup_retention  = number
  }))

  default = {
    production = {
      instance_type     = "m5.xlarge"
      min_capacity      = 2
      max_capacity      = 10
      multi_az          = true
      backup_retention  = 30
    }
    staging = {
      instance_type     = "m5.large"
      min_capacity      = 1
      max_capacity      = 3
      multi_az          = false
      backup_retention  = 7
    }
    development = {
      instance_type     = "t3.medium"
      min_capacity      = 1
      max_capacity      = 2
      multi_az          = false
      backup_retention  = 1
    }
  }
}

# Use environment-specific configuration
locals {
  config = var.environment_configs[var.environment]
}

resource "aws_db_instance" "main" {
  identifier     = "${var.project}-${var.environment}"
  engine         = "postgres"
  engine_version = "13.7"
  instance_class = "db.${local.config.instance_type}"

  allocated_storage     = local.config.storage_size
  max_allocated_storage = local.config.max_storage_size

  multi_az               = local.config.multi_az
  backup_retention_period = local.config.backup_retention

  tags = local.common_tags
}

Scheduled Resource Management

Implement automated scheduling for non-production environments:

# Bash script to stop non-production resources during off-hours
#!/bin/bash

# Stop development environment instances at 6 PM weekdays
if [[ $(date +%u) -le 5 ]] && [[ $(date +%H) -eq 18 ]]; then
    aws ec2 stop-instances --instance-ids \
        $(aws ec2 describe-instances \
            --filters "Name=tag:Environment,Values=development" \
                     "Name=instance-state-name,Values=running" \
            --query 'Reservations[].Instances[].InstanceId' \
            --output text)
fi

# Start development environment instances at 8 AM weekdays
if [[ $(date +%u) -le 5 ]] && [[ $(date +%H) -eq 8 ]]; then
    aws ec2 start-instances --instance-ids \
        $(aws ec2 describe-instances \
            --filters "Name=tag:Environment,Values=development" \
                     "Name=instance-state-name,Values=stopped" \
            --query 'Reservations[].Instances[].InstanceId' \
            --output text)
fi

This approach can reduce development environment costs by 60-70% by only running resources during business hours.

Part of: A Practical Guide to FinOps for DevOps Engineers

Updated: 1/15/2025

Tags