Skip to main content

Observability

Browse all articles, tutorials, and guides about Observability

5posts

Posts

DevOps
|12 min read

OpenTelemetry Just Graduated: What to Retire from Your Stack This Quarter

On May 21, 2026, CNCF graduated OpenTelemetry. All three core signals (traces, metrics, logs) are now production-ready, the project is the second-most-active in CNCF after Kubernetes itself, and Anthropic, Bloomberg, Capital One, eBay, and Heroku run it at scale. Here is the decision framework for what proprietary agents you can stop running, what is still risky, and the 90-day adoption checklist.

DevOps
|11 min read

How to Build an Effective On-Call Rotation and Escalation Policy

Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.

DevOps
|11 min read

Distributed Tracing with OpenTelemetry: From Instrumentation to Visualization

A walkthrough of instrumenting a real service with OpenTelemetry, running the Collector, and finding the slow span in Jaeger when a request hops across five microservices.

DevOps
|10 min read

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.

DevOps
|6 min read

What is P99 Latency?

P99 latency measures the response time at the 99th percentile, showing how fast your slowest 1% of requests are. Learn why P99 is more important than average latency for understanding real user experience.