The 5-Minute Kubernetes Cluster Health Check
TLDR
You can check your Kubernetes cluster's health in under 5 minutes using five key commands: checking node status, monitoring resource usage, reviewing pod health across namespaces, investigating problem pods, and examining cluster events. This quick routine helps catch issues before they escalate into critical problems.
Kubernetes is great until it's not. One bad node, a pod stuck in CrashLoopBackOff, or a resource spike can ruin your day. The good news? You don't need to spend an hour digging through dashboards to spot trouble early. With a few quick commands, you can get a solid read on your cluster's health in under 5 minutes.
Here's how to do it effectively.
Make Sure Your Nodes Are Happy
Start by checking the overall status of your cluster nodes. This gives you the foundation-level health of your infrastructure.
kubectl get nodes -o wide
This command displays all nodes in your cluster along with their detailed information. You'll see each node's status, roles, age, version, internal and external IPs, OS image, kernel version, and container runtime.
What you want to see:
- STATUS should be
Ready
for all nodes - No mystery nodes suddenly showing up in your cluster
- Roles, IPs, and ages that make sense for your environment
If you spot NotReady
, that's your cue to dig deeper. A node in this state might be experiencing network issues, resource exhaustion, or kubelet problems.
Check Resource Usage at a Glance
Next, get a quick overview of resource consumption across your nodes to identify potential bottlenecks.
kubectl top nodes
This command shows CPU and memory usage for each node in your cluster. It provides both absolute values and percentages, making it easy to spot resource pressure.
Keep an eye out for:
- CPU or memory regularly above 80% on any node
- One node doing all the heavy lifting while others are barely working
- Sudden spikes that don't match your expected workload patterns
No metrics-server
running? Install it with this command:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
The metrics-server is essential for resource monitoring and is required for horizontal pod autoscaling to work properly.
Look at All Pods Across All Namespaces
Get a bird's-eye view of all pods running in your cluster to quickly identify any that are misbehaving.
kubectl get pods --all-namespaces
This command lists every pod across all namespaces, showing their current status, restart count, and age. It's like taking the pulse of your entire application ecosystem.
Healthy pods should be Running
or Completed
. If you see states like CrashLoopBackOff
, ImagePullBackOff
, Pending
, or Error
, note the namespace and pod name for further investigation.
Also watch the RESTARTS column closely. If a pod has restarted a dozen times in the last hour, something's definitely off. Frequent restarts often indicate:
- Application crashes due to bugs or configuration issues
- Failing health checks (readiness or liveness probes)
- Resource limits being exceeded
- Dependencies being unavailable
Zoom In on Problem Pods
When you spot problematic pods, dig deeper to understand what's causing the issues.
kubectl describe pod <pod-name> -n <namespace>
Replace <pod-name>
and <namespace>
with the actual values from your problem pods. This command provides detailed information about the pod's configuration, current state, and recent events.
Check for these common issues:
- Events at the bottom (often the smoking gun that reveals the root cause)
- Failing readiness or liveness probes that prevent the pod from receiving traffic
- Image pull errors indicating registry access problems or incorrect image names
- Resource limit issues where the pod exceeds its memory or CPU constraints
The events section is particularly valuable because it shows a chronological history of what happened to the pod, including scheduling decisions, volume mounts, and error conditions.
Check the Cluster's Event Log
Get insight into what's been happening across your entire cluster by examining recent events.
kubectl get events --sort-by=.metadata.creationTimestamp
This command shows cluster-wide events sorted by when they occurred, giving you a timeline of recent activity. Events provide context about system-level operations and can reveal patterns or issues that affect multiple components.
Events will tell you what's been happening behind the scenes:
- Failed volume mounts that prevent pods from starting
- DNS resolution errors affecting service communication
- Scheduling issues when pods can't be placed on nodes
- Node pressure warnings indicating resource constraints
Try k9s for a Better View
If you want something more interactive than command-line tools, give k9s a try. It's a terminal-based UI for Kubernetes that provides real-time cluster information in an intuitive interface.
k9s lets you browse resources, view logs, and drill into problems without typing long commands. You can navigate between different resource types using simple keystrokes, filter resources, and even perform actions like scaling deployments or deleting pods.
Once you try k9s, it's hard to go back to plain kubectl for exploratory tasks. It's particularly useful when you need to quickly jump between different namespaces or resource types during troubleshooting.
Five minutes a day is all it takes to stay ahead of most cluster problems. Make this health check part of your daily routine and you'll catch issues before they blow up and before your pager goes off at 3 a.m. Regular monitoring helps you understand your cluster's normal behavior, making it easier to spot anomalies when they occur.
Found an issue?