Understanding Kubernetes Operators: A Deep Dive with a Practical Example
If you've worked with Kubernetes for any length of time, you've probably heard the term "operator" thrown around. Maybe you've installed one—like the Prometheus Operator or cert-manager—without fully understanding what makes it different from a regular Deployment. This post aims to change that.
We'll start by understanding why operators exist and the fundamental patterns they implement. Then we'll build one from scratch, explaining each concept as we go. By the end, you'll not only have a working operator but a mental model for how all Kubernetes controllers work under the hood.
What Is a Kubernetes Operator?
Before diving into operators, let's step back and understand how Kubernetes itself works.
The Declarative Model: Kubernetes' Core Philosophy
Kubernetes is built on a declarative model. You don't tell Kubernetes "start 3 pods"—you tell it "I want 3 pods running." The difference is subtle but profound:
- Imperative: "Do this action" (create, delete, scale)
- Declarative: "Make it look like this" (desired state)
When you apply a Deployment manifest, you're declaring your desired state. Kubernetes then figures out what actions are needed to make reality match that declaration. If a pod crashes, Kubernetes doesn't need you to tell it to restart—it sees the discrepancy and acts.
This is powerful because it makes your infrastructure self-healing. You describe what you want, and Kubernetes continuously works to maintain that state.
The Control Loop Pattern: How Kubernetes Makes Decisions
This "observe and act" behavior is implemented through control loops (also called reconciliation loops). Every controller in Kubernetes follows the same pattern:
┌─────────────────────────────────────────────────────────────┐
│ Control Loop │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ OBSERVE │───▶│ DIFF │───▶│ ACT │ │
│ │ │ │ │ │ │ │
│ │ Current │ │ Current │ │ Create/ │ │
│ │ State │ │ vs │ │ Update/ │ │
│ │ │ │ Desired │ │ Delete │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ │ │
│ └──────────────────────────────┘ │
│ (repeat forever) │
└─────────────────────────────────────────────────────────────┘
- Observe: Watch for changes to resources (via the Kubernetes API)
- Diff: Compare current state with desired state
- Act: Make changes to close the gap
- Repeat: Keep watching for more changes
The Deployment controller, for example, watches Deployment resources. When you create one asking for 3 replicas, it observes there are 0 pods, calculates a diff of -3, and creates 3 ReplicaSets (which in turn create pods).
Why is this pattern so important? Because it's convergent. No matter how the system gets into a bad state—whether from a crash, network partition, or manual tampering—the controller will keep trying to fix it. This is fundamentally different from scripts that run once and hope nothing changes.
So What Makes an Operator Special?
An operator is simply a custom controller that manages custom resources. That's it.
The built-in controllers (Deployment, Service, etc.) manage built-in resources. When you need to manage something Kubernetes doesn't understand natively—like a PostgreSQL cluster, a machine learning pipeline, or a complex application—you create:
- A Custom Resource Definition (CRD): Teaches Kubernetes about your new resource type
- A Controller: Watches for those resources and takes action
Together, these form an operator. The term "operator" comes from the idea that you're encoding the knowledge of a human operator (the person who knows how to run your application) into software.
Why Not Just Use Helm or Scripts?
You might wonder: "Can't I just use Helm charts or shell scripts?"
The key difference is continuous reconciliation:
| Approach | When It Runs | What Happens If State Drifts |
|---|---|---|
| Shell script | Once, when you run it | Nothing—drift accumulates |
| Helm install | Once, at install time | Nothing—you must re-run |
| Operator | Continuously | Automatically corrects drift |
An operator is always watching and always correcting. If someone manually deletes a resource your application needs, the operator recreates it. If a config drifts, the operator fixes it. This is called level-triggered behavior (reacting to state) vs edge-triggered (reacting to events).
Think of it this way: A Helm chart is like a recipe. An operator is like a chef who keeps checking on the dish and adjusting as needed.
Real-World Operator Examples
To make this concrete, here's what some popular operators do:
Prometheus Operator: You create a
PrometheusCR specifying retention, replicas, and alerting rules. The operator creates the StatefulSet, ConfigMaps, Services, and wires up service discovery—tasks that would otherwise require deep Prometheus expertise.cert-manager: You create a
CertificateCR specifying the domain. The operator handles ACME challenges, creates secrets with the cert, and renews before expiration—no cron jobs needed.PostgreSQL Operator (Zalando): You create a
postgresqlCR. The operator provisions the primary, replicas, handles failover, backups, and connection pooling—encoding years of DBA knowledge.
In each case, you declare what you want, and the operator handles how to achieve and maintain it.
Prerequisites
Before we build our operator, ensure you have these tools installed:
- Go 1.22+: The operator will be written in Go
- Docker: For building container images
- kubectl: For interacting with your cluster
- kind or minikube: For local Kubernetes testing
- Kubebuilder 3.15+: The scaffolding tool we'll use
Why Go?
While operators can be written in any language (Python, Java, Rust, etc.), Go is the dominant choice because:
- Kubernetes itself is written in Go
- The official client libraries (
client-go,controller-runtime) are Go-native and battle-tested - The tooling (Kubebuilder, Operator SDK) generates Go code with best practices baked in
- Go's concurrency model fits well with the watch/reconcile pattern
Why Kubebuilder?
Writing a controller from scratch requires significant boilerplate: setting up informers (to watch resources efficiently), work queues (to deduplicate reconciliation requests), leader election (so only one replica reconciles at a time), metrics, health checks, etc.
Kubebuilder generates all of this, letting you focus on your business logic. It's maintained by the Kubernetes SIG (Special Interest Group) and represents community best practices.
Install Kubebuilder:
curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder
sudo mv kubebuilder /usr/local/bin/
kubebuilder version
Project Overview: Building a Website Operator
We'll build a "Website" operator—simple enough to understand fully, but complex enough to demonstrate real patterns.
The User Experience We're Creating
A developer creates a Website resource:
apiVersion: webapp.example.com/v1
kind: Website
metadata:
name: my-blog
spec:
replicas: 2
html: "<h1>Welcome to my blog</h1>"
What the Operator Does Behind the Scenes
- Creates a ConfigMap with the HTML content
- Creates a Deployment with nginx containers that mount the ConfigMap
- Creates a Service to expose the website
- Keeps everything in sync if anything changes or gets deleted
The developer doesn't need to understand Deployments, Services, or ConfigMaps. They just declare "I want a website" and the operator handles the rest.
Step 1: Initialize the Project
Let's scaffold our project:
mkdir website-operator && cd website-operator
kubebuilder init --domain example.com --repo github.com/yourorg/website-operator
Understanding the Flags
--domain: Your organization's domain. This becomes part of your API group (e.g.,webapp.example.com). Choose something unique to avoid conflicts with other operators.--repo: The Go module path. This must match your actual repo if you plan to push it. Go uses this for imports.
What Gets Generated
Kubebuilder creates a significant project structure. Let's understand what matters:
website-operator/
├── cmd/main.go # Entry point—sets up the manager
├── config/ # Kubernetes manifests for deployment
│ ├── default/ # Kustomize base for deploying
│ ├── manager/ # Deployment for the operator itself
│ └── rbac/ # Generated RBAC rules
├── internal/controller/ # Where your reconciliation logic lives
├── Dockerfile # Multi-stage build for the operator
└── Makefile # Common tasks (build, test, deploy)
The Manager: Your Operator's Brain
The cmd/main.go file sets up what Kubebuilder calls a "Manager." This is crucial to understand:
// Simplified version of what's in cmd/main.go
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
MetricsBindAddress: metricsAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "website-operator.example.com",
})
The Manager:
- Connects to the Kubernetes API using in-cluster config or your kubeconfig
- Runs all your controllers in a coordinated way
- Handles leader election so only one replica reconciles at a time (critical for consistency)
- Exposes Prometheus metrics at
/metrics - Manages graceful shutdown when receiving SIGTERM
You rarely need to modify this file—Kubebuilder sets it up correctly.
Step 2: Create the API and Controller
Now we create our custom resource type and its controller:
kubebuilder create api --group webapp --version v1 --kind Website
Answer y to both prompts (create resource and controller).
Understanding the Naming Convention
Kubernetes API resources follow a strict naming convention:
- Group: Like a package name, groups related resources (e.g.,
apps,networking.k8s.io). Ours iswebapp. - Version: API version (
v1,v1beta1,v1alpha1)—allows your API to evolve over time - Kind: The resource type name (capitalized, singular)
The full API group becomes webapp.example.com (group + domain from init).
When users interact with our resource, they'll write:
apiVersion: webapp.example.com/v1 # group/version
kind: Website # kind
What Gets Generated
This command generates two critical files:
| File | Purpose |
|---|---|
api/v1/website_types.go |
Go structs defining your CRD schema |
internal/controller/website_controller.go |
Reconciliation logic |
Let's examine each.
Step 3: Define the Custom Resource
The generated api/v1/website_types.go has placeholder fields. Before writing code, let's think about API design.
Thinking About Your API
A CRD has two main sections:
- Spec: What the user wants (input)
- Status: What currently exists (output, read-only for users)
For our Website:
- Spec: replicas, image, HTML content
- Status: ready replicas, available URL, conditions
Good API design principle: The spec should be simple and declarative. Users shouldn't need to understand implementation details.
Edit api/v1/website_types.go:
package v1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// WebsiteSpec defines the desired state of Website
// This is what users configure
type WebsiteSpec struct {
// Replicas is the number of nginx pods to run
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=10
// +kubebuilder:default=1
Replicas int32 `json:"replicas,omitempty"`
// Image is the container image to use
// +kubebuilder:default="nginx:1.27-alpine"
Image string `json:"image,omitempty"`
// HTML is the content to serve
// +kubebuilder:validation:MinLength=1
HTML string `json:"html"`
}
// WebsiteStatus defines the observed state of Website
// This is what the operator reports back
type WebsiteStatus struct {
// ReadyReplicas is how many pods are actually ready
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
// URL is where the website can be accessed
URL string `json:"url,omitempty"`
// Conditions represent the latest observations
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="URL",type=string,JSONPath=`.status.url`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
// Website is the Schema for the websites API
type Website struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec WebsiteSpec `json:"spec,omitempty"`
Status WebsiteStatus `json:"status,omitempty"`
}
// +kubebuilder:object:root=true
// WebsiteList contains a list of Website
type WebsiteList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []Website `json:"items"`
}
func init() {
SchemeBuilder.Register(&Website{}, &WebsiteList{})
}
Understanding the Marker Comments
Those // +kubebuilder: comments aren't just documentation—they're markers that Kubebuilder's code generator reads:
| Marker | What It Does |
|---|---|
+kubebuilder:validation:Minimum=1 |
Adds OpenAPI validation to the CRD |
+kubebuilder:default=1 |
Sets default value if user doesn't specify |
+kubebuilder:subresource:status |
Enables the /status subresource (important for RBAC separation) |
+kubebuilder:printcolumn:... |
Adds columns to kubectl get websites output |
Why use markers instead of Go code? Because CRDs are defined in YAML and served by the Kubernetes API server. The markers generate that YAML from your Go types, keeping everything in sync.
After editing, regenerate the manifests:
make manifests
This updates config/crd/bases/webapp.example.com_websites.yaml with your schema.
Step 4: Implement the Reconciliation Logic
Now for the heart of the operator: the reconciliation function. This is where you implement the control loop.
Understanding the Reconcile Function
Open internal/controller/website_controller.go. The generated code looks like:
func (r *WebsiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
_ = log.FromContext(ctx)
// TODO: your logic here
return ctrl.Result{}, nil
}
This function gets called whenever:
- A Website resource is created, updated, or deleted
- A resource the Website "owns" changes (we'll set this up)
- A periodic resync happens (configurable, default 10 hours)
- You explicitly request a requeue
Important: The function receives a Request containing just the namespace/name of the resource. You must fetch the actual resource yourself. This is intentional—it prevents stale data issues.
The Reconciliation Pattern
Here's the mental model for writing reconciliation logic:
1. Fetch the primary resource (Website)
- If not found → it was deleted, nothing to do
2. For each dependent resource (ConfigMap, Deployment, Service):
a. Define what it SHOULD look like
b. Check if it EXISTS
c. If not exists → CREATE it
d. If exists but different → UPDATE it
3. Update the status of the primary resource
4. Return success (or requeue if needed)
Let's implement this:
package controller
import (
"context"
"fmt"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
"k8s.io/apimachinery/pkg/util/intstr"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
webappv1 "github.com/yourorg/website-operator/api/v1"
)
// WebsiteReconciler reconciles a Website object
type WebsiteReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// +kubebuilder:rbac:groups=webapp.example.com,resources=websites,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=websites/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=webapp.example.com,resources=websites/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=configmaps,verbs=get;list;watch;create;update;patch;delete
func (r *WebsiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
// ============================================================
// STEP 1: Fetch the Website resource
// ============================================================
// We always start by fetching the primary resource. If it's gone,
// Kubernetes garbage collection handles cleanup (via OwnerReferences).
website := &webappv1.Website{}
if err := r.Get(ctx, req.NamespacedName, website); err != nil {
if errors.IsNotFound(err) {
// Resource was deleted - nothing to do
// Owned resources get cleaned up automatically
logger.Info("Website resource not found, likely deleted")
return ctrl.Result{}, nil
}
// Error fetching - requeue
return ctrl.Result{}, err
}
logger.Info("Reconciling Website", "name", website.Name)
// ============================================================
// STEP 2: Reconcile the ConfigMap (holds HTML content)
// ============================================================
// ConfigMap stores our HTML. We create it first because the
// Deployment needs to mount it.
configMap := r.configMapForWebsite(website)
if err := r.reconcileConfigMap(ctx, website, configMap); err != nil {
return ctrl.Result{}, err
}
// ============================================================
// STEP 3: Reconcile the Deployment (runs nginx pods)
// ============================================================
deployment := r.deploymentForWebsite(website)
if err := r.reconcileDeployment(ctx, website, deployment); err != nil {
return ctrl.Result{}, err
}
// ============================================================
// STEP 4: Reconcile the Service (exposes the pods)
// ============================================================
service := r.serviceForWebsite(website)
if err := r.reconcileService(ctx, website, service); err != nil {
return ctrl.Result{}, err
}
// ============================================================
// STEP 5: Update the Website status
// ============================================================
// Fetch the current deployment to get ready replica count
currentDeployment := &appsv1.Deployment{}
if err := r.Get(ctx, types.NamespacedName{
Name: website.Name,
Namespace: website.Namespace,
}, currentDeployment); err == nil {
website.Status.ReadyReplicas = currentDeployment.Status.ReadyReplicas
}
website.Status.URL = fmt.Sprintf("http://%s.%s.svc.cluster.local",
website.Name, website.Namespace)
if err := r.Status().Update(ctx, website); err != nil {
logger.Error(err, "Failed to update Website status")
return ctrl.Result{}, err
}
logger.Info("Successfully reconciled Website")
return ctrl.Result{}, nil
}
The Helper Functions: Building Desired State
Now let's implement the helper functions. Each one defines what a resource SHOULD look like:
// configMapForWebsite creates the desired ConfigMap spec
func (r *WebsiteReconciler) configMapForWebsite(website *webappv1.Website) *corev1.ConfigMap {
return &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: website.Name,
Namespace: website.Namespace,
},
Data: map[string]string{
"index.html": website.Spec.HTML,
},
}
}
// deploymentForWebsite creates the desired Deployment spec
func (r *WebsiteReconciler) deploymentForWebsite(website *webappv1.Website) *appsv1.Deployment {
labels := map[string]string{
"app": "website",
"website": website.Name,
}
replicas := website.Spec.Replicas
// Determine the image to use
image := website.Spec.Image
if image == "" {
image = "nginx:1.27-alpine"
}
return &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: website.Name,
Namespace: website.Namespace,
},
Spec: appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: labels,
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: labels,
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "nginx",
Image: image,
Ports: []corev1.ContainerPort{{
ContainerPort: 80,
}},
VolumeMounts: []corev1.VolumeMount{{
Name: "html",
MountPath: "/usr/share/nginx/html",
}},
}},
Volumes: []corev1.Volume{{
Name: "html",
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{
Name: website.Name,
},
},
},
}},
},
},
},
}
}
// serviceForWebsite creates the desired Service spec
func (r *WebsiteReconciler) serviceForWebsite(website *webappv1.Website) *corev1.Service {
return &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: website.Name,
Namespace: website.Namespace,
},
Spec: corev1.ServiceSpec{
Selector: map[string]string{
"app": "website",
"website": website.Name,
},
Ports: []corev1.ServicePort{{
Port: 80,
TargetPort: intstr.FromInt(80),
}},
Type: corev1.ServiceTypeClusterIP,
},
}
}
The Reconcile Helpers: Create or Update Pattern
Now the functions that actually create or update resources:
func (r *WebsiteReconciler) reconcileConfigMap(ctx context.Context, website *webappv1.Website, desired *corev1.ConfigMap) error {
logger := log.FromContext(ctx)
// Set owner reference - this is crucial for garbage collection!
// When the Website is deleted, this ConfigMap will be automatically deleted too
if err := ctrl.SetControllerReference(website, desired, r.Scheme); err != nil {
return err
}
// Check if ConfigMap already exists
existing := &corev1.ConfigMap{}
err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
if errors.IsNotFound(err) {
// Doesn't exist - create it
logger.Info("Creating ConfigMap", "name", desired.Name)
return r.Create(ctx, desired)
} else if err != nil {
return err
}
// Exists - check if it needs updating
if existing.Data["index.html"] != desired.Data["index.html"] {
logger.Info("Updating ConfigMap", "name", desired.Name)
existing.Data = desired.Data
return r.Update(ctx, existing)
}
return nil
}
func (r *WebsiteReconciler) reconcileDeployment(ctx context.Context, website *webappv1.Website, desired *appsv1.Deployment) error {
logger := log.FromContext(ctx)
if err := ctrl.SetControllerReference(website, desired, r.Scheme); err != nil {
return err
}
existing := &appsv1.Deployment{}
err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
if errors.IsNotFound(err) {
logger.Info("Creating Deployment", "name", desired.Name)
return r.Create(ctx, desired)
} else if err != nil {
return err
}
// Check if spec changed (replicas or image)
needsUpdate := false
if *existing.Spec.Replicas != *desired.Spec.Replicas {
needsUpdate = true
}
if existing.Spec.Template.Spec.Containers[0].Image != desired.Spec.Template.Spec.Containers[0].Image {
needsUpdate = true
}
if needsUpdate {
logger.Info("Updating Deployment", "name", desired.Name)
existing.Spec.Replicas = desired.Spec.Replicas
existing.Spec.Template.Spec.Containers[0].Image = desired.Spec.Template.Spec.Containers[0].Image
return r.Update(ctx, existing)
}
return nil
}
func (r *WebsiteReconciler) reconcileService(ctx context.Context, website *webappv1.Website, desired *corev1.Service) error {
logger := log.FromContext(ctx)
if err := ctrl.SetControllerReference(website, desired, r.Scheme); err != nil {
return err
}
existing := &corev1.Service{}
err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, existing)
if errors.IsNotFound(err) {
logger.Info("Creating Service", "name", desired.Name)
return r.Create(ctx, desired)
} else if err != nil {
return err
}
// Services are mostly immutable after creation, skip update
return nil
}
Understanding Owner References
Notice the ctrl.SetControllerReference() calls. This is critical:
if err := ctrl.SetControllerReference(website, desired, r.Scheme); err != nil {
return err
}
This sets an OwnerReference on the child resource pointing to the Website. When Kubernetes sees this:
- If the Website is deleted, all owned resources are automatically deleted (garbage collection)
- Changes to owned resources trigger reconciliation of the owner
kubectl get configmap my-site -o yamlshows the owner
This is why we don't need cleanup code—Kubernetes handles it automatically.
Setting Up the Controller Watches
Finally, we need to tell the controller what to watch. Add this at the bottom of the file:
func (r *WebsiteReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webappv1.Website{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Owns(&corev1.ConfigMap{}).
Complete(r)
}
What does this mean?
For(&webappv1.Website{}): Primary resource to watch. Any CRUD triggers reconcile.Owns(&appsv1.Deployment{}): Also watch Deployments that have our Website as owner. If someone deletes the Deployment, we'll recreate it.
This is what makes operators self-healing!
Step 5: Build and Deploy the Operator
Install the CRD
First, install your Custom Resource Definition:
make install
This runs kubectl apply on the generated CRD in config/crd/bases/.
Run Locally for Development
During development, run the operator outside the cluster:
make run
This is faster than building images for every change. The operator uses your kubeconfig to connect.
Build and Deploy to Cluster
For production, build and push the image:
# Build the image
make docker-build IMG=yourregistry/website-operator:v1
# Push to registry
make docker-push IMG=yourregistry/website-operator:v1
# Deploy to cluster
make deploy IMG=yourregistry/website-operator:v1
Step 6: Test the Operator
Create a Website Resource
Create sample-website.yaml:
apiVersion: webapp.example.com/v1
kind: Website
metadata:
name: hello-world
namespace: default
spec:
replicas: 2
html: |
<!DOCTYPE html>
<html>
<head><title>Hello from Operator</title></head>
<body>
<h1>Hello, Kubernetes Operator!</h1>
<p>This website is managed by a custom operator.</p>
</body>
</html>
Apply it:
kubectl apply -f sample-website.yaml
Verify the Resources
# Check the Website resource
kubectl get websites
NAME REPLICAS READY URL AGE
hello-world 2 2 http://hello-world.default.svc.cluster.local 30s
# Check created resources
kubectl get deployment,service,configmap -l website=hello-world
Test Self-Healing
Delete the deployment and watch it get recreated:
kubectl delete deployment hello-world
kubectl get deployment hello-world -w # Watch it come back
Access the Website
kubectl port-forward svc/hello-world 8080:80
# Open http://localhost:8080
Test Updates
Change the HTML and apply again:
kubectl patch website hello-world --type=merge -p '{"spec":{"html":"<h1>Updated!</h1>"}}'
Watch the ConfigMap update and pods restart.
Step 7: Add Unit Tests
Kubebuilder generates a test suite using Ginkgo and envtest (an in-memory Kubernetes API server).
Edit internal/controller/website_controller_test.go:
package controller
import (
"context"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
webappv1 "github.com/yourorg/website-operator/api/v1"
)
var _ = Describe("Website Controller", func() {
const (
timeout = time.Second * 10
interval = time.Millisecond * 250
)
Context("When creating a Website", func() {
It("Should create a Deployment with correct replicas", func() {
ctx := context.Background()
// Create a Website
website := &webappv1.Website{
ObjectMeta: metav1.ObjectMeta{
Name: "test-website",
Namespace: "default",
},
Spec: webappv1.WebsiteSpec{
Replicas: 3,
HTML: "<h1>Test</h1>",
},
}
Expect(k8sClient.Create(ctx, website)).Should(Succeed())
// Verify Deployment is created
deploymentKey := types.NamespacedName{Name: "test-website", Namespace: "default"}
deployment := &appsv1.Deployment{}
Eventually(func() error {
return k8sClient.Get(ctx, deploymentKey, deployment)
}, timeout, interval).Should(Succeed())
Expect(*deployment.Spec.Replicas).Should(Equal(int32(3)))
})
It("Should create a ConfigMap with the HTML content", func() {
ctx := context.Background()
configMapKey := types.NamespacedName{Name: "test-website", Namespace: "default"}
configMap := &corev1.ConfigMap{}
Eventually(func() error {
return k8sClient.Get(ctx, configMapKey, configMap)
}, timeout, interval).Should(Succeed())
Expect(configMap.Data["index.html"]).Should(Equal("<h1>Test</h1>"))
})
It("Should create a Service", func() {
ctx := context.Background()
serviceKey := types.NamespacedName{Name: "test-website", Namespace: "default"}
service := &corev1.Service{}
Eventually(func() error {
return k8sClient.Get(ctx, serviceKey, service)
}, timeout, interval).Should(Succeed())
Expect(service.Spec.Ports[0].Port).Should(Equal(int32(80)))
})
})
})
Run the tests:
make test
Best Practices for Production Operators
Handle Finalizers for External Resource Cleanup
Owner references handle Kubernetes resources, but what about external resources (cloud infrastructure, DNS records, external databases)?
Use finalizers—they block deletion until you've cleaned up:
import "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
const websiteFinalizer = "webapp.example.com/finalizer"
func (r *WebsiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
website := &webappv1.Website{}
if err := r.Get(ctx, req.NamespacedName, website); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Check if being deleted
if !website.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(website, websiteFinalizer) {
// Perform cleanup of external resources
if err := r.cleanupExternalResources(website); err != nil {
return ctrl.Result{}, err
}
// Remove finalizer to allow deletion to proceed
controllerutil.RemoveFinalizer(website, websiteFinalizer)
return ctrl.Result{}, r.Update(ctx, website)
}
return ctrl.Result{}, nil
}
// Add finalizer if not present
if !controllerutil.ContainsFinalizer(website, websiteFinalizer) {
controllerutil.AddFinalizer(website, websiteFinalizer)
return ctrl.Result{}, r.Update(ctx, website)
}
// Normal reconciliation...
return ctrl.Result{}, nil
}
How it works: When you kubectl delete a resource with a finalizer, Kubernetes sets deletionTimestamp but doesn't actually delete until all finalizers are removed.
Implement Proper Error Handling and Requeuing
Not all errors are equal:
func (r *WebsiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// ...
// Transient error (API rate limit, network blip) - retry soon
if isTransientError(err) {
return ctrl.Result{RequeueAfter: time.Second * 30}, nil
}
// Permanent error (invalid config) - don't retry, update status
if isPermanentError(err) {
website.Status.Conditions = append(website.Status.Conditions, metav1.Condition{
Type: "Ready",
Status: metav1.ConditionFalse,
Reason: "ConfigurationError",
Message: err.Error(),
})
r.Status().Update(ctx, website)
return ctrl.Result{}, nil // Don't return error, don't requeue
}
return ctrl.Result{}, nil
}
Use Conditions for Status Reporting
Conditions are the standard way to communicate resource state:
import "k8s.io/apimachinery/pkg/api/meta"
// Set a condition
meta.SetStatusCondition(&website.Status.Conditions, metav1.Condition{
Type: "Ready",
Status: metav1.ConditionTrue,
Reason: "ReconcileSuccess",
Message: "All resources created successfully",
LastTransitionTime: metav1.Now(),
})
// Check a condition
if meta.IsStatusConditionTrue(website.Status.Conditions, "Ready") {
// Website is ready
}
Add Metrics for Observability
Kubebuilder includes Prometheus metrics. Add custom metrics:
import (
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
)
var (
websiteReconcileTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "website_reconcile_total",
Help: "Total number of reconciliations per website",
},
[]string{"website", "namespace"},
)
websiteReconcileErrors = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "website_reconcile_errors_total",
Help: "Total number of reconciliation errors",
},
[]string{"website", "namespace"},
)
)
func init() {
metrics.Registry.MustRegister(websiteReconcileTotal, websiteReconcileErrors)
}
Conclusion
Building a Kubernetes operator is about encoding operational knowledge into software. The Website operator we built demonstrates patterns that apply to any operator:
- CRDs define your API: Users interact with simple, declarative resources
- Reconciliation loops converge to desired state: Always comparing and fixing
- Owner references enable garbage collection: No manual cleanup needed
- Watches enable self-healing: Changes to owned resources trigger reconciliation
- Status provides observability: Users can see what's happening
The operator pattern is powerful because it lets you build autonomous systems. Instead of scripts that run once and hope, operators continuously ensure your infrastructure matches what you declared.
From here, you can extend your operator with:
- Webhooks for validation (reject invalid configs) and mutation (set defaults)
- Multiple CRDs with relationships between them
- Integration with external services (DNS, cloud providers, databases)
- Leader election for high availability (already built into the Manager)
The Kubebuilder book and Operator SDK documentation provide deeper dives into these topics. Start simple, solve a real problem, and iterate based on actual needs.
These amazing companies help us create free, high-quality DevOps content for the community
DigitalOcean
Cloud infrastructure for developers
Simple, reliable cloud computing designed for developers
DevDojo
Developer community & tools
Join a community of developers sharing knowledge and tools
Want to support DevOps Daily and reach thousands of developers?
Become a SponsorFound an issue?