OpenTelemetry Collector¶

OpenTelemetry Collector (GitHub) is a vendor-agnostic telemetry pipeline that receives, processes, and exports observability signals — traces, metrics, and logs — through a configurable receiver → processor → exporter architecture. Unlike application-embedded exporters that couple services to specific backends, the Collector acts as a protocol-normalizing intermediary: applications emit OTLP once, and the Collector handles fan-out, enrichment, and format translation.

The contrib distribution is deployed here, which bundles the full ecosystem of community receivers and exporters (Jaeger, Zipkin, Prometheus, Loki OTLP) in a single binary. This is heavier than a custom-built distribution but eliminates the need to maintain a custom builder pipeline for a homelab where binary size is not a constraint.

What distinguishes the OTel Collector from similar pipeline tools (Grafana Agent/Alloy, Fluentd, Vector): it implements the OTLP standard natively, supports all three signal types in a single process with shared processor chains, and has first-class Kubernetes metadata enrichment via the k8sattributes processor — automatically decorating every span, metric, and log line with pod, namespace, and deployment context without application-level instrumentation changes.

Overview¶

Property	Value
Namespace	`opentelemetry`
Type	HelmRelease (chart: `opentelemetry-collector` v0.108.0)
Layer	Distributed tracing services
Chart	`opentelemetry-collector` v0.108.0
Status	Enabled
Source	`apps/base/opentelemetry-collector/`

Dependencies¶

Upstream — required before OpenTelemetry Collector starts¶

Service	Reason	Status
`jaeger`	Flux `dependsOn`	Active
`loki`	Flux `dependsOn`	Active
`kube-prometheus-stack`	Flux `dependsOn`	Active

Downstream — services that depend on OpenTelemetry Collector¶

Service	Dependency type	Reason
`agentgateway-config`	Flux `dependsOn`	Requires OpenTelemetry Collector

Purpose¶

The OpenTelemetry Collector serves as the platform's single telemetry ingestion point. All application workloads — including the kagent multi-agent system and supporting services — emit telemetry to one endpoint (opentelemetry-collector:4317) using OTLP. The Collector then routes traces to Jaeger, metrics to Prometheus (via a scrape-compatible exporter endpoint), and logs to Loki using native OTLP HTTP ingestion.

This decouples every application from backend knowledge: services never import Jaeger or Loki client libraries, never hardcode backend endpoints, and remain unaffected when backends are swapped or reconfigured. The Collector also enriches all signals with Kubernetes metadata (pod name, namespace, deployment, node) through the k8sattributes processor, providing consistent correlation dimensions across traces, metrics, and logs without per-service instrumentation effort.

Why OTel Collector over direct-to-backend or Grafana Agent: Direct emission (apps → Jaeger for traces, Promtail → Loki for logs) was the initial approach but created tight coupling — every service needed backend-specific client libraries and endpoint configuration. Migrating a single backend meant touching every service.

Grafana Agent (now Alloy) was considered as a unified pipeline. It integrates well with the Grafana stack already deployed here, but it implements a proprietary configuration language (River) and its OTLP support trails the reference implementation. The OTel Collector uses the OTLP standard natively, has broader community momentum, and avoids deeper lock-in to the Grafana ecosystem — important given that trace and metric backends may change independently of the visualization layer.

Features¶

Feature	Detail
Multi-protocol ingestion	Accepts OTLP (gRPC :4317, HTTP :4318), Jaeger (gRPC :14250, thrift_http :14268, thrift_compact :6831, thrift_binary :6832), Zipkin (:9411), and Prometheus self-scrape — enabling incremental migration from legacy instrumentation without application changes.
Kubernetes metadata enrichment	The k8sattributes processor uses serviceAccount auth to extract pod, deployment, statefulset, daemonset, cronjob, job, node, and namespace metadata plus app.kubernetes.io labels, attaching them to all telemetry signals automatically.
Memory-bounded processing	The memory_limiter processor (80% limit, 25% spike allowance, 5s check interval) prevents OOM under burst load by applying backpressure to receivers before the process exceeds its cgroup memory allocation.
Batched export with retry	The batch processor (1024 batch size, 2048 max, 10s timeout) amortizes export overhead; the Jaeger exporter adds a 1000-item sending queue with exponential backoff retry (1s→10s, 60s max elapsed) to absorb transient backend unavailability.
Three-signal pipeline routing	Traces route to Jaeger via OTLP gRPC, metrics expose a Prometheus-scrapable endpoint on :8889 with resource-to-telemetry conversion, and logs push to Loki via native OTLP HTTP — each pipeline shares the same processor chain.
Kubernetes presets enabled	kubernetesAttributes, clusterMetrics, and kubeletMetrics presets are active, providing cluster-level resource metrics and kubelet stats without manual receiver configuration.
ServiceMonitor integration	A ServiceMonitor resource exposes the Prometheus exporter port (:8889) with 30s scrape interval, enabling kube-prometheus-stack to discover and scrape collector-processed metrics automatically.
Diagnostic extensions	health_check (:13133), pprof (:1777), and zpages (:55679) extensions are enabled for liveness probing, CPU/memory profiling, and pipeline-level debug visualization respectively.

Architecture¶

Telemetry Pipeline Topology¶

graph TD
    subgraph opentelemetry["Namespace: opentelemetry"]
        OTEL["opentelemetry-collector<br/>(Deployment)"]
    end

    subgraph jaeger-ns["Namespace: jaeger"]
        JAEGER["jaeger-collector"]
    end

    subgraph monitoring["Namespace: monitoring"]
        LOKI["monitoring-loki"]
        PROM["kube-prometheus-stack"]
    end

    APP["Application Workloads<br/>(OTLP clients)"]
    AGW["agentgateway-config<br/>(downstream)"]

    APP -->|"OTLP gRPC :4317"| OTEL
    APP -->|"OTLP HTTP :4318"| OTEL
    APP -->|"Jaeger thrift_http :14268"| OTEL
    APP -->|"Zipkin :9411"| OTEL

    OTEL -->|"OTLP gRPC :4317<br/>traces"| JAEGER
    OTEL -->|"OTLP HTTP :3100/otlp<br/>logs"| LOKI
    PROM -->|"scrape :8889<br/>metrics"| OTEL

    OTEL -.->|"config dependency"| AGW

Signal Processing Flow¶

graph LR
    subgraph receivers["Receivers"]
        R_OTLP["otlp<br/>gRPC + HTTP"]
        R_JAEGER["jaeger<br/>gRPC + thrift"]
        R_ZIPKIN["zipkin"]
        R_PROM["prometheus<br/>self-scrape :8888"]
    end

    subgraph processors["Processors (ordered)"]
        P_MEM["memory_limiter<br/>80% limit"]
        P_K8S["k8sattributes<br/>pod/ns/deploy metadata"]
        P_BATCH["batch<br/>1024 size, 10s timeout"]
    end

    subgraph exporters["Exporters"]
        E_JAEGER["otlp/jaeger<br/>→ jaeger-collector:4317"]
        E_PROM["prometheus<br/>:8889 scrape endpoint"]
        E_LOKI["otlphttp/loki<br/>→ loki:3100/otlp"]
        E_DEBUG["debug<br/>basic verbosity"]
    end

    R_OTLP --> P_MEM
    R_JAEGER --> P_MEM
    R_ZIPKIN --> P_MEM
    R_PROM --> P_MEM

    P_MEM --> P_K8S
    P_K8S --> P_BATCH

    P_BATCH -->|"traces"| E_JAEGER
    P_BATCH -->|"metrics"| E_PROM
    P_BATCH -->|"metrics"| E_DEBUG
    P_BATCH -->|"logs"| E_LOKI
    P_BATCH -->|"logs"| E_DEBUG

Configuration¶

All values sourced from base/services/environment.env (base); per-environment overrides in clusters/stages/dev/.../environment.env.

Parameter	Dev	Prod
`OTEL_COLLECTOR_CHART_VERSION`	`0.108.0`	`0.108.0`
`OTEL_COLLECTOR_CPU_LIMIT`	`500m`	`500m`
`OTEL_COLLECTOR_CPU_REQUEST`	`100m`	`100m`
`OTEL_COLLECTOR_MEMORY_LIMIT`	`512Mi`	`512Mi`
`OTEL_COLLECTOR_MEMORY_REQUEST`	`256Mi`	`256Mi`
`OTEL_COLLECTOR_REPLICAS`	`1`	`1`

Operations¶

Collector OOMKilled under burst traffic¶

Symptoms: Pod restarts with OOMKilled reason. kubectl describe pod shows last termination reason as OOMKilled. Alert: KubePodCrashLooping or KubeContainerOOMKilled firing for opentelemetry-collector in namespace opentelemetry. Collector internal metric otelcol_processor_refused_spans or otelcol_processor_refused_metric_points spiking before the kill.

kubectl -n opentelemetry describe pod -l app.kubernetes.io/name=opentelemetry-collector | grep -A5 'Last State'
kubectl -n opentelemetry top pod -l app.kubernetes.io/name=opentelemetry-collector
kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --previous | grep -i 'memory_limiter'
kubectl -n flux-system get helmrelease opentelemetry-collector -o jsonpath='{.spec.values.config.processors.memory_limiter}'
kubectl -n opentelemetry port-forward svc/opentelemetry-collector 55679:55679 & sleep 2 && curl -s http://localhost:55679/debug/tracez | head -50

Traces not reaching Jaeger — exporter queue saturation¶

Symptoms: No new traces visible in Jaeger UI. Collector logs show sending queue is full or context deadline exceeded for the otlp/jaeger exporter. Metric otelcol_exporter_queue_size{exporter="otlp/jaeger"} at 1000 (queue capacity). Metric otelcol_exporter_send_failed_spans{exporter="otlp/jaeger"} incrementing.

kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --tail=100 | grep -E 'otlp/jaeger|queue|deadline'
kubectl -n jaeger get pods -l app.kubernetes.io/component=collector -o wide
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- http://localhost:8888/metrics | grep 'otelcol_exporter.*jaeger'
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- --spider http://jaeger-collector.jaeger.svc.cluster.local:4317 2>&1 | head -5
kubectl -n jaeger logs -l app.kubernetes.io/component=collector --tail=50 | grep -i error

Logs not reaching Loki — OTLP HTTP export failures¶

Symptoms: No new logs in Grafana/Loki. Collector logs show connection refused or 5xx errors for the otlphttp/loki exporter. Metric otelcol_exporter_send_failed_log_records{exporter="otlphttp/loki"} incrementing.

kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --tail=100 | grep -E 'otlphttp/loki|loki|3100'
kubectl -n monitoring get pods -l app.kubernetes.io/name=loki -o wide
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- --spider http://monitoring-loki.monitoring.svc.cluster.local:3100/ready 2>&1
kubectl -n monitoring logs -l app.kubernetes.io/name=loki --tail=50 | grep -iE 'otlp|error|reject'
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- http://localhost:8888/metrics | grep 'otelcol_exporter.*loki'

Collector pod CrashLoopBackOff after config change¶

Symptoms: Pod enters CrashLoopBackOff immediately after HelmRelease reconciliation. Logs show cannot unmarshal or failed to resolve config errors during startup. kubectl -n flux-system get helmrelease opentelemetry-collector shows upgrade remediation exhausted.

kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector --tail=30
kubectl -n flux-system get helmrelease opentelemetry-collector -o jsonpath='{.status.conditions[*].message}'
kubectl -n flux-system get kustomization opentelemetry-collector -o jsonpath='{.status.conditions[*].message}'
kubectl -n opentelemetry get events --sort-by='.lastTimestamp' | tail -20
kubectl -n flux-system get configmap cluster-vars -o yaml | grep OTEL_COLLECTOR

See also: docs/adr/010-opentelemetry-collector.md

ServiceMonitor not scraping — metrics missing from Prometheus¶

Symptoms: No otelcol_* metrics in Prometheus. ServiceMonitor exists but Prometheus targets page shows the collector endpoint as down or missing entirely.

kubectl -n opentelemetry get servicemonitor -l app.kubernetes.io/name=opentelemetry-collector -o yaml
kubectl -n opentelemetry get endpoints -l app.kubernetes.io/name=opentelemetry-collector -o yaml | grep -A5 'ports'
kubectl -n opentelemetry exec -it deploy/opentelemetry-collector -- wget -qO- http://localhost:8889/metrics | head -20
kubectl -n monitoring exec -it deploy/kube-prometheus-stack-prometheus -- promtool query instant http://localhost:9090 'up{job=~".*opentelemetry.*"}'
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=50 | grep -i 'opentelemetry\|scrape.*error'

k8sattributes processor failing — missing Kubernetes metadata on signals¶

Symptoms: Traces/metrics/logs arrive at backends but lack k8s.namespace.name, k8s.pod.name, and other Kubernetes resource attributes. Collector logs may show kube API errors or RBAC denied messages.

kubectl -n opentelemetry logs -l app.kubernetes.io/name=opentelemetry-collector | grep -iE 'k8sattributes|kube.*api|forbidden|unauthorized'
kubectl -n opentelemetry get serviceaccount -l app.kubernetes.io/name=opentelemetry-collector -o name
kubectl get clusterrolebinding -l app.kubernetes.io/name=opentelemetry-collector -o yaml | grep -A10 'roleRef'
kubectl auth can-i list pods --as=system:serviceaccount:opentelemetry:opentelemetry-collector --all-namespaces
kubectl auth can-i get nodes --as=system:serviceaccount:opentelemetry:opentelemetry-collector

apps/base/opentelemetry-collector/ — Kubernetes manifests
base/services/opentelemetry-collector.yaml — Flux Kustomization
base/services/environment.env — environment variables

Generated from service-catalog.json at commit afaa467 · catalog sha c57d485ae12bc679