OpenTelemetry Collector¶
OpenTelemetry Collector (GitHub) is a vendor-agnostic telemetry pipeline that receives, processes, and exports observability signals — traces, metrics, and logs — through a configurable receiver → processor → exporter architecture. Unlike application-embedded exporters that couple services to specific backends, the Collector acts as a protocol-normalizing intermediary: applications emit OTLP once, and the Collector handles fan-out, enrichment, and format translation.
The contrib distribution is deployed here, which bundles the full ecosystem of community receivers and exporters (Jaeger, Zipkin, Prometheus, Loki OTLP) in a single binary. This is heavier than a custom-built distribution but eliminates the need to maintain a custom builder pipeline for a homelab where binary size is not a constraint.
What distinguishes the OTel Collector from similar pipeline tools (Grafana Agent/Alloy, Fluentd, Vector): it implements the OTLP standard natively, supports all three signal types in a single process with shared processor chains, and has first-class Kubernetes metadata enrichment via the k8sattributes processor — automatically decorating every span, metric, and log line with pod, namespace, and deployment context without application-level instrumentation changes.
Overview¶
| Property | Value |
|---|---|
| Namespace | opentelemetry |
| Type | HelmRelease (chart: opentelemetry-collector v0.108.0) |
| Layer | Distributed tracing services |
| Chart | opentelemetry-collector v0.108.0 |
| Status | Enabled |
| Source | apps/base/opentelemetry-collector/ |
Dependencies¶
Upstream — required before OpenTelemetry Collector starts¶
| Service | Reason | Status |
|---|---|---|
jaeger |
Flux dependsOn |
Active |
loki |
Flux dependsOn |
Active |
kube-prometheus-stack |
Flux dependsOn |
Active |
Downstream — services that depend on OpenTelemetry Collector¶
| Service | Dependency type | Reason |
|---|---|---|
agentgateway-config |
Flux dependsOn |
Requires OpenTelemetry Collector |
Purpose¶
The OpenTelemetry Collector serves as the platform's single telemetry ingestion point. All application workloads — including the kagent multi-agent system and supporting services — emit telemetry to one endpoint (opentelemetry-collector:4317) using OTLP. The Collector then routes traces to Jaeger, metrics to Prometheus (via a scrape-compatible exporter endpoint), and logs to Loki using native OTLP HTTP ingestion.
This decouples every application from backend knowledge: services never import Jaeger or Loki client libraries, never hardcode backend endpoints, and remain unaffected when backends are swapped or reconfigured. The Collector also enriches all signals with Kubernetes metadata (pod name, namespace, deployment, node) through the k8sattributes processor, providing consistent correlation dimensions across traces, metrics, and logs without per-service instrumentation effort.
Why OTel Collector over direct-to-backend or Grafana Agent: Direct emission (apps → Jaeger for traces, Promtail → Loki for logs) was the initial approach but created tight coupling — every service needed backend-specific client libraries and endpoint configuration. Migrating a single backend meant touching every service.
Grafana Agent (now Alloy) was considered as a unified pipeline. It integrates well with the Grafana stack already deployed here, but it implements a proprietary configuration language (River) and its OTLP support trails the reference implementation. The OTel Collector uses the OTLP standard natively, has broader community momentum, and avoids deeper lock-in to the Grafana ecosystem — important given that trace and metric backends may change independently of the visualization layer.
Features¶
| Feature | Detail |
|---|---|
| Multi-protocol ingestion | Accepts OTLP (gRPC :4317, HTTP :4318), Jaeger (gRPC :14250, thrift_http :14268, thrift_compact :6831, thrift_binary :6832), Zipkin (:9411), and Prometheus self-scrape — enabling incremental migration from legacy instrumentation without application changes. |
| Kubernetes metadata enrichment | The k8sattributes processor uses serviceAccount auth to extract pod, deployment, statefulset, daemonset, cronjob, job, node, and namespace metadata plus app.kubernetes.io labels, attaching them to all telemetry signals automatically. |
| Memory-bounded processing | The memory_limiter processor (80% limit, 25% spike allowance, 5s check interval) prevents OOM under burst load by applying backpressure to receivers before the process exceeds its cgroup memory allocation. |
| Batched export with retry | The batch processor (1024 batch size, 2048 max, 10s timeout) amortizes export overhead; the Jaeger exporter adds a 1000-item sending queue with exponential backoff retry (1s→10s, 60s max elapsed) to absorb transient backend unavailability. |
| Three-signal pipeline routing | Traces route to Jaeger via OTLP gRPC, metrics expose a Prometheus-scrapable endpoint on :8889 with resource-to-telemetry conversion, and logs push to Loki via native OTLP HTTP — each pipeline shares the same processor chain. |
| Kubernetes presets enabled | kubernetesAttributes, clusterMetrics, and kubeletMetrics presets are active, providing cluster-level resource metrics and kubelet stats without manual receiver configuration. |
| ServiceMonitor integration | A ServiceMonitor resource exposes the Prometheus exporter port (:8889) with 30s scrape interval, enabling kube-prometheus-stack to discover and scrape collector-processed metrics automatically. |
| Diagnostic extensions | health_check (:13133), pprof (:1777), and zpages (:55679) extensions are enabled for liveness probing, CPU/memory profiling, and pipeline-level debug visualization respectively. |
Architecture¶
Telemetry Pipeline Topology¶
graph TD
subgraph opentelemetry["Namespace: opentelemetry"]
OTEL["opentelemetry-collector<br/>(Deployment)"]
end
subgraph jaeger-ns["Namespace: jaeger"]
JAEGER["jaeger-collector"]
end
subgraph monitoring["Namespace: monitoring"]
LOKI["monitoring-loki"]
PROM["kube-prometheus-stack"]
end
APP["Application Workloads<br/>(OTLP clients)"]
AGW["agentgateway-config<br/>(downstream)"]
APP -->|"OTLP gRPC :4317"| OTEL
APP -->|"OTLP HTTP :4318"| OTEL
APP -->|"Jaeger thrift_http :14268"| OTEL
APP -->|"Zipkin :9411"| OTEL
OTEL -->|"OTLP gRPC :4317<br/>traces"| JAEGER
OTEL -->|"OTLP HTTP :3100/otlp<br/>logs"| LOKI
PROM -->|"scrape :8889<br/>metrics"| OTEL
OTEL -.->|"config dependency"| AGW
Signal Processing Flow¶
graph LR
subgraph receivers["Receivers"]
R_OTLP["otlp<br/>gRPC + HTTP"]
R_JAEGER["jaeger<br/>gRPC + thrift"]
R_ZIPKIN["zipkin"]
R_PROM["prometheus<br/>self-scrape :8888"]
end
subgraph processors["Processors (ordered)"]
P_MEM["memory_limiter<br/>80% limit"]
P_K8S["k8sattributes<br/>pod/ns/deploy metadata"]
P_BATCH["batch<br/>1024 size, 10s timeout"]
end
subgraph exporters["Exporters"]
E_JAEGER["otlp/jaeger<br/>→ jaeger-collector:4317"]
E_PROM["prometheus<br/>:8889 scrape endpoint"]
E_LOKI["otlphttp/loki<br/>→ loki:3100/otlp"]
E_DEBUG["debug<br/>basic verbosity"]
end
R_OTLP --> P_MEM
R_JAEGER --> P_MEM
R_ZIPKIN --> P_MEM
R_PROM --> P_MEM
P_MEM --> P_K8S
P_K8S --> P_BATCH
P_BATCH -->|"traces"| E_JAEGER
P_BATCH -->|"metrics"| E_PROM
P_BATCH -->|"metrics"| E_DEBUG
P_BATCH -->|"logs"| E_LOKI
P_BATCH -->|"logs"| E_DEBUG
Configuration¶
All values sourced from base/services/environment.env
(base); per-environment overrides in clusters/stages/dev/.../environment.env.
| Parameter | Dev | Prod |
|---|---|---|
OTEL_COLLECTOR_CHART_VERSION |
0.108.0 |
0.108.0 |
OTEL_COLLECTOR_CPU_LIMIT |
500m |
500m |
OTEL_COLLECTOR_CPU_REQUEST |
100m |
100m |
OTEL_COLLECTOR_MEMORY_LIMIT |
512Mi |
512Mi |
OTEL_COLLECTOR_MEMORY_REQUEST |
256Mi |
256Mi |
OTEL_COLLECTOR_REPLICAS |
1 |
1 |
Operations¶
Collector OOMKilled under burst traffic¶
Symptoms: Pod restarts with OOMKilled reason. kubectl describe pod shows last termination reason as OOMKilled. Alert: KubePodCrashLooping or KubeContainerOOMKilled firing for opentelemetry-collector in namespace opentelemetry. Collector internal metric otelcol_processor_refused_spans or otelcol_processor_refused_metric_points spiking before the kill.
kubectl -n opentelemetry describe pod -l app.kubernetes.io/name=opentelemetry-collector | grep -A5 'Last State'
kubectl -n opentelemetry top pod -l app.kubernetes.io/name=opentelemetry-collector
kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --previous | grep -i 'memory_limiter'
kubectl -n flux-system get helmrelease opentelemetry-collector -o jsonpath='{.spec.values.config.processors.memory_limiter}'
kubectl -n opentelemetry port-forward svc/opentelemetry-collector 55679:55679 & sleep 2 && curl -s http://localhost:55679/debug/tracez | head -50
Traces not reaching Jaeger — exporter queue saturation¶
Symptoms: No new traces visible in Jaeger UI. Collector logs show sending queue is full or context deadline exceeded for the otlp/jaeger exporter. Metric otelcol_exporter_queue_size{exporter="otlp/jaeger"} at 1000 (queue capacity). Metric otelcol_exporter_send_failed_spans{exporter="otlp/jaeger"} incrementing.
kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --tail=100 | grep -E 'otlp/jaeger|queue|deadline'
kubectl -n jaeger get pods -l app.kubernetes.io/component=collector -o wide
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- http://localhost:8888/metrics | grep 'otelcol_exporter.*jaeger'
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- --spider http://jaeger-collector.jaeger.svc.cluster.local:4317 2>&1 | head -5
kubectl -n jaeger logs -l app.kubernetes.io/component=collector --tail=50 | grep -i error
Logs not reaching Loki — OTLP HTTP export failures¶
Symptoms: No new logs in Grafana/Loki. Collector logs show connection refused or 5xx errors for the otlphttp/loki exporter. Metric otelcol_exporter_send_failed_log_records{exporter="otlphttp/loki"} incrementing.
kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --tail=100 | grep -E 'otlphttp/loki|loki|3100'
kubectl -n monitoring get pods -l app.kubernetes.io/name=loki -o wide
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- --spider http://monitoring-loki.monitoring.svc.cluster.local:3100/ready 2>&1
kubectl -n monitoring logs -l app.kubernetes.io/name=loki --tail=50 | grep -iE 'otlp|error|reject'
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- http://localhost:8888/metrics | grep 'otelcol_exporter.*loki'
Collector pod CrashLoopBackOff after config change¶
Symptoms: Pod enters CrashLoopBackOff immediately after HelmRelease reconciliation. Logs show cannot unmarshal or failed to resolve config errors during startup. kubectl -n flux-system get helmrelease opentelemetry-collector shows upgrade remediation exhausted.
kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --tail=30
kubectl -n flux-system get helmrelease opentelemetry-collector -o jsonpath='{.status.conditions[*].message}'
kubectl -n flux-system get kustomization opentelemetry-collector -o jsonpath='{.status.conditions[*].message}'
kubectl -n opentelemetry get events --sort-by='.lastTimestamp' | tail -20
kubectl -n flux-system get configmap cluster-vars -o yaml | grep OTEL_COLLECTOR
ServiceMonitor not scraping — metrics missing from Prometheus¶
Symptoms: No otelcol_* metrics in Prometheus. ServiceMonitor exists but Prometheus targets page shows the collector endpoint as down or missing entirely.
kubectl -n opentelemetry get servicemonitor -l app.kubernetes.io/name=opentelemetry-collector -o yaml
kubectl -n opentelemetry get endpoints -l app.kubernetes.io/name=opentelemetry-collector -o yaml | grep -A5 'ports'
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- http://localhost:8889/metrics | head -20
kubectl -n monitoring exec -it deploy/kube-prometheus-stack-prometheus -- promtool query instant http://localhost:9090 'up{job=~".*opentelemetry.*"}'
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=50 | grep -i 'opentelemetry\|scrape.*error'
k8sattributes processor failing — missing Kubernetes metadata on signals¶
Symptoms: Traces/metrics/logs arrive at backends but lack k8s.namespace.name, k8s.pod.name, and other Kubernetes resource attributes. Collector logs may show kube API errors or RBAC denied messages.
kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector | grep -iE 'k8sattributes|kube.*api|forbidden|unauthorized'
kubectl -n opentelemetry get serviceaccount -l app.kubernetes.io/name=opentelemetry-collector -o name
kubectl get clusterrolebinding -l app.kubernetes.io/name=opentelemetry-collector -o yaml | grep -A10 'roleRef'
kubectl auth can-i list pods --as=system:serviceaccount:opentelemetry:opentelemetry-collector --all-namespaces
kubectl auth can-i get nodes --as=system:serviceaccount:opentelemetry:opentelemetry-collector
Related¶
apps/base/opentelemetry-collector/— Kubernetes manifestsbase/services/opentelemetry-collector.yaml— Flux Kustomizationbase/services/environment.env— environment variables
Generated from service-catalog.json at commit 165b485 · catalog sha 4d088b0b3a67b4c4