Common Cloud Cost Challenges for DevOps Teams
Identify and understand the most frequent cloud cost problems that DevOps teams encounter and why they happen.
DevOps teams face unique cloud cost challenges that stem from the intersection of rapid deployment practices, complex cloud pricing models, and traditional financial processes. Understanding these challenges helps you identify cost optimization opportunities and prevent common pitfalls.
Over-Provisioning: The "Just to Be Safe" Problem
Over-provisioning happens when teams allocate more resources than actually needed, often as insurance against performance issues. This practice stems from on-premises thinking where adding capacity required lengthy procurement processes.
Why Over-Provisioning Happens
Fear of Performance Issues: Teams often choose larger instance sizes to avoid potential slowdowns, especially for customer-facing applications.
Lack of Right-Sizing Knowledge: Without clear usage data, teams make educated guesses that tend toward over-allocation.
One-Size-Fits-All Approaches: Using the same instance type for all environments, even when development and staging don't need production-level resources.
Here's a typical example of over-provisioning in Terraform:
# Common over-provisioning pattern
resource "aws_instance" "web" {
count = 3
ami = "ami-0abcdef1234567890"
instance_type = "m5.2xlarge" # Often oversized for actual needs
tags = {
Name = "web-server-${count.index}"
Environment = var.environment
}
}
# Data volume that's larger than needed
resource "aws_ebs_volume" "data" {
availability_zone = "us-west-2a"
size = 1000 # Often allocated based on worst-case estimates
type = "gp3"
tags = {
Name = "application-data"
}
}
The Real Cost Impact
Over-provisioning compounds quickly across environments and teams. A single over-sized production instance might cost an extra $200/month, but when multiplied across development, staging, and production environments for multiple teams, the waste becomes significant.
Example calculation:
- Production: 3 x m5.2xlarge ($0.384/hour) = $829/month
- Right-sized: 3 x m5.large ($0.096/hour) = $207/month
- Monthly waste: $622 per application
Identifying Over-Provisioning
You can identify over-provisioned resources by monitoring utilization metrics:
# AWS CLI command to get EC2 CPU utilization for the last 30 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -d '30 days ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average,Maximum
Resources consistently running below 40% CPU utilization are candidates for right-sizing.
Zombie Resources: The Hidden Cost Drain
Zombie resources are cloud resources that continue running and generating charges long after they're needed. These often result from development workflows, testing, or failed cleanup processes.
Common Types of Zombie Resources
Forgotten Development Environments: Temporary environments created for testing that never get cleaned up.
Orphaned Load Balancers: Load balancers that continue running after their target instances are terminated.
Unused Storage Volumes: EBS volumes, S3 buckets, or database snapshots that are no longer attached to active resources.
Idle Databases: Database instances that were created for testing and left running indefinitely.
Here's an example of how zombie resources accumulate in typical development workflows:
# Developer creates a feature branch environment
git checkout -b feature/new-payment-flow
terraform workspace new feature-payment-flow
terraform apply # Spins up full infrastructure
# Feature gets merged and deployed
git checkout main
git branch -d feature/new-payment-flow
# Terraform workspace and infrastructure remain running
# Monthly cost: $150-500 depending on services
The Zombie Detection Problem
Zombies are difficult to detect because:
- They often have minimal usage, avoiding basic monitoring alerts
- Resource tags may not clearly indicate their purpose or ownership
- Development teams may fear deleting resources they don't fully understand
Implementing Zombie Detection
The key to zombie detection is combining resource metadata with usage patterns. You want to identify resources that are running but not actually being used productively.
A simple approach focuses on two main indicators:
- Age: Resources that have been running for extended periods
- Utilization: Very low CPU, network, or disk activity
Here's a practical script that identifies potential zombie EC2 instances by checking these criteria:
import boto3
from datetime import datetime, timedelta
def find_zombie_instances():
"""Find EC2 instances that might be zombies based on age and low utilization"""
ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
# Get all running instances
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
zombies = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
launch_time = instance['LaunchTime']
# Check if instance is older than 30 days
days_running = (datetime.now(launch_time.tzinfo) - launch_time).days
if days_running > 30:
# Check average CPU over the last week
avg_cpu = get_average_cpu_utilization(cloudwatch, instance_id)
# Flag as potential zombie if very low CPU usage
if avg_cpu is not None and avg_cpu < 5:
zombies.append({
'InstanceId': instance_id,
'DaysRunning': days_running,
'AvgCPU': avg_cpu
})
return zombies
def get_average_cpu_utilization(cloudwatch, instance_id):
"""Get average CPU utilization for an instance over the last 7 days"""
try:
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.now() - timedelta(days=7),
EndTime=datetime.now(),
Period=3600, # 1 hour periods
Statistics=['Average']
)
if response['Datapoints']:
return sum(dp['Average'] for dp in response['Datapoints']) / len(response['Datapoints'])
return None
except Exception as e:
print(f"Could not get metrics for {instance_id}: {e}")
return None
This script identifies instances that have been running for more than 30 days with less than 5% average CPU utilization. You can customize these thresholds based on your environment.
To use this effectively:
- Run it weekly as part of your cost optimization routine
- Review flagged instances before taking action - some may have legitimate low usage
- Consider adding checks for network or disk utilization for more accuracy
- Always verify with the resource owner before terminating anything
Lack of Cost Attribution: The Shared Account Problem
Many organizations use shared cloud accounts across multiple teams, projects, or environments. Without proper cost attribution, it becomes impossible to understand which teams or projects are driving costs.
Why Cost Attribution Fails
Inconsistent Tagging: Teams use different tagging conventions or forget to tag resources entirely.
Shared Resources: Infrastructure components like networking or monitoring that serve multiple applications.
Resource Sprawl: As teams scale, keeping track of resource ownership becomes increasingly difficult.
The Business Impact
Without cost attribution:
- Teams can't make informed decisions about feature costs
- Finance can't allocate cloud spend to appropriate budgets
- Cost optimization efforts lack clear priorities
- Accountability for cloud spend becomes diffused
Implementing Effective Cost Attribution
A robust tagging strategy is essential for cost attribution:
# Terraform locals for consistent tagging
locals {
common_tags = {
Project = var.project_name
Environment = var.environment
Team = var.team_name
Owner = var.owner_email
CostCenter = var.cost_center
ManagedBy = "terraform"
CreatedDate = formatdate("YYYY-MM-DD", timestamp())
}
}
# Apply tags consistently across all resources
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
tags = merge(local.common_tags, {
Name = "${var.project_name}-web-${var.environment}"
Component = "web-server"
Billable = "true"
})
}
resource "aws_s3_bucket" "assets" {
bucket = "${var.project_name}-assets-${var.environment}"
tags = merge(local.common_tags, {
Component = "static-assets"
Backup = "daily"
})
}
Automated Tag Enforcement
Use cloud provider policies to enforce tagging requirements:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": ["ec2:RunInstances"],
"Resource": ["arn:aws:ec2:*:*:instance/*"],
"Condition": {
"Null": {
"aws:RequestedRegion": "false",
"ec2:ResourceTag/Project": "true",
"ec2:ResourceTag/Environment": "true",
"ec2:ResourceTag/Team": "true"
}
}
}
]
}
This IAM policy prevents launching EC2 instances without required tags.
Development and Production Parity Costs
Many teams mirror production environments for development and staging, leading to unnecessary costs for non-production workloads that don't require production-level resources.
The Mirroring Trap
Database Sizing: Running production-sized databases in development environments
Multi-AZ Deployments: Using high-availability configurations in single-developer environments
Storage Replication: Maintaining production backup and replication settings in test environments
Smart Environment Scaling
Implement environment-specific configurations:
# Variable-driven resource sizing based on environment
variable "environment_configs" {
type = map(object({
instance_type = string
min_capacity = number
max_capacity = number
multi_az = bool
backup_retention = number
}))
default = {
production = {
instance_type = "m5.xlarge"
min_capacity = 2
max_capacity = 10
multi_az = true
backup_retention = 30
}
staging = {
instance_type = "m5.large"
min_capacity = 1
max_capacity = 3
multi_az = false
backup_retention = 7
}
development = {
instance_type = "t3.medium"
min_capacity = 1
max_capacity = 2
multi_az = false
backup_retention = 1
}
}
}
# Use environment-specific configuration
locals {
config = var.environment_configs[var.environment]
}
resource "aws_db_instance" "main" {
identifier = "${var.project}-${var.environment}"
engine = "postgres"
engine_version = "13.7"
instance_class = "db.${local.config.instance_type}"
allocated_storage = local.config.storage_size
max_allocated_storage = local.config.max_storage_size
multi_az = local.config.multi_az
backup_retention_period = local.config.backup_retention
tags = local.common_tags
}
Scheduled Resource Management
Implement automated scheduling for non-production environments:
# Bash script to stop non-production resources during off-hours
#!/bin/bash
# Stop development environment instances at 6 PM weekdays
if [[ $(date +%u) -le 5 ]] && [[ $(date +%H) -eq 18 ]]; then
aws ec2 stop-instances --instance-ids \
$(aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=development" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceId' \
--output text)
fi
# Start development environment instances at 8 AM weekdays
if [[ $(date +%u) -le 5 ]] && [[ $(date +%H) -eq 8 ]]; then
aws ec2 start-instances --instance-ids \
$(aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=development" \
"Name=instance-state-name,Values=stopped" \
--query 'Reservations[].Instances[].InstanceId' \
--output text)
fi
This approach can reduce development environment costs by 60-70% by only running resources during business hours.
Found an issue?