Why Run an OpenTelemetry Collector
Why run an OpenTelemetry Collector at all instead of having every application export directly to your tracing backend? And how would you deploy it in Kubernetes?
Why run an OpenTelemetry Collector at all instead of having every application export directly to your tracing backend? And how would you deploy it in Kubernetes?
Direct export couples every application to your backend. Change vendors, rotate an API key, or add span redaction, and you are redeploying 40 services. With a Collector in the middle, apps export plain OTLP to a local endpoint and everything else becomes pipeline config: receivers take data in, processors transform it, exporters fan it out. The processors are where the real value is. The batch processor cuts export overhead, memory_limiter stops the Collector from OOMing under a span flood, attributes and redaction processors strip PII before it leaves your cluster, and tail sampling can only happen here because no single app sees the whole trace. The Collector also buffers and retries when the backend is down, so a backend outage does not mean dropped telemetry or back-pressure into your apps. In Kubernetes the standard layout is two tiers. A DaemonSet agent on every node gives apps a node-local endpoint, adds k8s metadata like pod and namespace, and does cheap work like batching. A central gateway Deployment behind a Service handles the expensive, stateful work: tail sampling, filtering, authentication to external backends, and being the single egress point your firewall rules allow. Small clusters can skip the gateway and run just the DaemonSet pointed straight at the backend. You add the gateway when you need tail sampling or want one place to control egress and credentials.
This separates people who have operated telemetry pipelines from people who have only instrumented an app. The core insight you want to hear is decoupling: telemetry policy lives in pipeline config, not application code. Then probe the deployment topology. Strong candidates know the agent-versus-gateway split and, more importantly, why the split exists: tail sampling and credential management belong in a central tier, host metadata enrichment belongs on the node. If they mention memory_limiter or what happens when the backend goes down, they have run this in production.
Collector pipeline: receive OTLP, protect memory, batch, fan out
Two-tier deployment with the OpenTelemetry Operator
- Describing the Collector as just a proxy and missing the processing layer, which is the actual reason to run one
- Running a single replica gateway with no memory_limiter, so one chatty service OOMs the whole telemetry pipeline
- Putting tail sampling on the DaemonSet agents, where no node ever sees all spans of a trace, so sampling decisions are made on partial data
- The Collector itself becomes a single point of failure for your telemetry. How do you make the pipeline survive a Collector crash or a bad config rollout?
- How would you monitor the Collector itself, and which signals tell you it is dropping spans?
- When would you choose a sidecar Collector per pod over a node DaemonSet?
More Observability interview questions
Also worth your time on this topic
Distributed Tracing with OpenTelemetry: From Instrumentation to Visualization
A practical checklist for adding OpenTelemetry tracing to your services, shipping spans through the Collector, and turning that data into something you can actually debug with.
90-150 minutes
Traces and Spans Explained
A request hits your API gateway, which calls two backend services, and one of those queries a database. Walk me through what that looks like as a distributed trace. What is a span, and how do spans connect to each other?
junior
Distributed Tracing with OpenTelemetry: From Instrumentation to Visualization
A walkthrough of instrumenting a real service with OpenTelemetry, running the Collector, and finding the slow span in Jaeger when a request hops across five microservices.