Real-World Kubernetes Deployments
Kubernetes excels at managing containerized applications at scale, but transitioning from development to production involves numerous challenges not covered in most tutorials. After working with dozens of production Kubernetes deployments, we've compiled key lessons that will help you avoid common pitfalls and build more resilient systems.
Resource Management Is Non-Negotiable
In production, proper resource configuration isn't optional, it's essential.
# A properly configured deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: company/api-service:v1.2.3
resources:
requests:
memory: '256Mi'
cpu: '100m'
limits:
memory: '512Mi'
cpu: '500m'
Always set resource requests and limits for every container. Without them, you risk:
- Resource starvation when pods compete for limited resources
- Nodes becoming overcommitted, leading to instability
- Unpredictable performance under load
We've seen production outages caused by a single pod without resource limits consuming all available CPU on a node, affecting dozens of other services.
For memory-intensive applications like Java services, be especially careful with limits. Set them based on actual observed usage patterns rather than guesswork. Too low, and your pods will face OOMKilled errors; too high, and you waste cluster resources.
Implement Probes for Reliability
Kubernetes health probes determine when containers are ready and healthy. Skipping them has real consequences:
# Proper implementation of probes
spec:
containers:
- name: payment-processor
image: company/payment-processor:v2.1.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Configure probes with appropriate timeouts based on your application's behavior. We've seen teams set extremely short timeouts for Java applications, causing unnecessary restarts during garbage collection pauses.
Differentiate between readiness and liveness:
- Readiness probes control traffic routing - use them to prevent traffic to pods that aren't ready
- Liveness probes trigger pod restarts - use them only for detecting states where a restart is the remedy
One retail client kept experiencing cascading failures during peak traffic until we fixed probe configurations. Their database connections took 8 seconds to establish, but the readiness probe had only a 1-second timeout, preventing successful deployment during busy periods.
Network Policies Are Not Optional
By default, Kubernetes allows any pod to communicate with any other pod, a serious security concern in production.
# Restrictive network policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
namespace: api
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
Start with a zero-trust approach and explicitly define which communications are permitted. This limits the blast radius of security breaches.
A financial services client discovered unauthorized data access between namespaces after implementing network policy monitoring, highlighting connections that had been occurring silently for months.
Storage Requires Special Attention
Storage in Kubernetes introduces complexity, especially for stateful applications.
# Production-grade PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
annotations:
volume.beta.kubernetes.io/storage-class: 'managed-premium'
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
For production workloads:
- Use storage classes appropriate for your applications' performance needs
- Understand the backup mechanisms for your persistent volumes
- Test storage failover scenarios before relying on them in production
- Be aware of storage IOPS limits in cloud environments
One e-commerce company lost 4 hours of customer orders when they moved their database to a Kubernetes-managed volume without understanding the implications of node failure on their chosen storage class.
Secrets Management Requires a Strategy
Kubernetes secrets are base64-encoded, not encrypted. Without additional measures, they're visible to anyone with API access.
# Using external secrets operator with AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
spec:
refreshInterval: '15m'
secretStoreRef:
name: aws-secretsmanager
kind: ClusterSecretStore
target:
name: database-credentials
data:
- secretKey: username
remoteRef:
key: production/database
property: username
- secretKey: password
remoteRef:
key: production/database
property: password
Consider these approaches for production:
- Utilize a secret management tool like HashiCorp Vault or cloud provider secret stores
- Use solutions like Sealed Secrets or External Secrets Operator to securely store secrets in Git
- Implement RBAC to restrict which pods can access which secrets
Never store unencrypted secrets in your Git repositories, even private ones.
Implement Proper Pod Disruption Budgets
When Kubernetes needs to drain nodes for maintenance, Pod Disruption Budgets (PDBs) prevent too many replicas of a service from going down simultaneously.
# Pod Disruption Budget for critical service
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-api-pdb
spec:
minAvailable: 2 # or use maxUnavailable
selector:
matchLabels:
app: payment-api
Without PDBs, routine cluster upgrades can cause service outages. We've seen entire production services become unavailable during automated node upgrades simply because a team forgot to implement PDBs.
Define them for all critical services, especially stateful ones like databases or messaging systems.
Node Affinity for Performance-Sensitive Workloads
Not all workloads have the same resource needs. Use node affinity to place pods on appropriate hardware.
# Using node affinity for specialized workloads
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute-optimized
For a media processing client, we created dedicated node pools with GPUs and used node affinity to ensure video processing jobs landed on these specialized nodes, while keeping their cost-sensitive workloads on standard instances.
Node anti-affinity is equally important for high-availability:
# Ensuring pods spread across nodes
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: 'kubernetes.io/hostname'
This prevents multiple replicas of critical components from being scheduled on the same node, improving fault tolerance.
Monitoring Beyond Basic Metrics
While CPU and memory usage are important, production monitoring needs to go deeper. Implement:
- Golden Signals monitoring: Latency, traffic, errors, and saturation for all services
- Custom application metrics: Business-specific metrics tied to user experience
- Distributed tracing: For understanding service-to-service communication
# Prometheus ServiceMonitor example
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: api-service
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- production
A retail client was seeing timeouts during peak shopping hours. Standard monitoring showed normal CPU/memory usage, but custom metrics revealed database connection pool exhaustion, something not visible with default monitoring.
Auto-scaling Requires Careful Configuration
Horizontal Pod Autoscaling (HPA) is powerful but requires proper configuration based on real-world load patterns.
# Custom metric-based autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-processor
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-processor
minReplicas: 5
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: kafka_consumer_lag_sum
target:
type: AverageValue
averageValue: 100
For reliable autoscaling:
- Choose metrics that truly represent load (sometimes it's not CPU)
- Set appropriate cooldown periods to prevent thrashing
- Test scaling behavior with realistic load patterns before production
A fintech company implemented CPU-based autoscaling but found their services scaling too late during traffic spikes. Switching to request rate-based autoscaling improved responsiveness dramatically.
Upgrade Strategies Matter
Kubernetes upgrades are inevitable, and how you handle them impacts production availability.
# Rolling update with zero downtime configuration
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
For critical services, consider these practices:
- Use rolling updates with low
maxUnavailable
values - Implement readiness gates tied to application health
- For stateful services, plan for manual verification steps
We helped a healthcare client implement blue/green deployments for their patient-facing applications to ensure zero-downtime updates and easier rollbacks when needed.
Resource Quotas Prevent Surprises
Namespace resource quotas protect your cluster from unexpected resource consumption.
# Namespace resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
namespace: team-a
spec:
hard:
requests.cpu: '10'
requests.memory: 20Gi
limits.cpu: '20'
limits.memory: 40Gi
pods: '30'
Without quotas, a single team or application can accidentally consume all available cluster resources. This creates a "noisy neighbor" problem that affects other applications.
For large organizations, implement quotas as a standard practice when creating namespaces. Combined with LimitRanges, they enforce resource discipline across teams.
Stateful Applications Require Extra Care
Running stateful services like databases in Kubernetes remains challenging.
# StatefulSet with volume claim templates
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: 'postgres'
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:14
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ['ReadWriteOnce']
storageClassName: 'premium-ssd'
resources:
requests:
storage: 100Gi
For production stateful services:
- Use operators designed for that specific technology when available (e.g., the PostgreSQL Operator)
- Implement proper backup and restore procedures
- Test failure scenarios thoroughly before production
While running databases in Kubernetes is increasingly common, many organizations still find it beneficial to use managed database services from cloud providers for critical production workloads.
Security Is a Continuous Process
Kubernetes security isn't a one-time setup; it requires ongoing attention.
- Scan container images: Implement vulnerability scanning in your CI/CD pipeline
- Update base images: Regularly rebuild containers with the latest base images
- Implement pod security standards: Use Pod Security Admission or OPA/Gatekeeper
# Pod Security Context
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Always run containers as non-root users and apply the principle of least privilege. Containers should only have the permissions they actually need to function.
Conclusion: Production Readiness Is a Journey
Moving to production Kubernetes is an ongoing process, not a destination. Even mature deployments require continuous refinement as applications evolve and usage patterns change.
Start with these production-focused practices:
- Implement comprehensive resource management
- Configure proper health probes
- Secure your cluster with network policies
- Plan your storage strategy carefully
- Implement proper monitoring and alerting
- Test failure modes before they happen in production
The most successful Kubernetes deployments we've seen share a common trait: they treat infrastructure as a product that evolves continuously rather than a project with a fixed endpoint.
By applying these production lessons, you'll avoid many of the painful experiences others have encountered on their Kubernetes journey.
Found an issue?