observability

Browse all articles, tutorials, and guides about observability

7posts

Posts

⌘K

2026-06-27|7 min read

Splunk Shipped an Unauthenticated Database Sidecar: CVE-2026-20253

You did not install a PostgreSQL server, but Splunk Enterprise 10 did, and in affected versions its sidecar endpoint had no authentication. The result is a pre-auth, CVSS 9.8 path to writing files on the host as the Splunk user, now on CISA's actively-exploited list. The bug is patched; the broader lesson is about every helper service your tools quietly bundle.

DevOps

2026-06-19|10 min read

AI SRE Agents: What They Actually Fix, and What They Will Happily Break

AI SRE is now its own category, with every incident vendor shipping an agent that investigates and remediates on its own. Here is the honest split: where these agents genuinely earn their keep, where they are oversold, and the one risk nobody puts on the marketing page.

DevOps

2026-05-26|12 min read

OpenTelemetry Just Graduated: What to Retire from Your Stack This Quarter

On May 21, 2026, CNCF graduated OpenTelemetry. All three core signals (traces, metrics, logs) are now production-ready, the project is the second-most-active in CNCF after Kubernetes itself, and Anthropic, Bloomberg, Capital One, eBay, and Heroku run it at scale. Here is the decision framework for what proprietary agents you can stop running, what is still risky, and the 90-day adoption checklist.

DevOps

2026-05-25|11 min read

How to Build an Effective On-Call Rotation and Escalation Policy

Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.

DevOps

2026-05-11|11 min read

Distributed Tracing with OpenTelemetry: From Instrumentation to Visualization

A walkthrough of instrumenting a real service with OpenTelemetry, running the Collector, and finding the slow span in Jaeger when a request hops across five microservices.

DevOps

2026-04-13|10 min read

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.

DevOps

2025-03-12|6 min read

What is P99 Latency?

P99 latency measures the response time at the 99th percentile, showing how fast your slowest 1% of requests are. Learn why P99 is more important than average latency for understanding real user experience.