Observability¶
Metrics, logging, and tracing for distributed systems.
Three Pillars¶
graph TB
subgraph "Observability"
M[Metrics]
L[Logs]
T[Traces]
end
M -->|Prometheus| G[Grafana]
L -->|Loki| G
T -->|Tempo| G
Metrics Strategy¶
USE Method (Resources)¶
| Metric | Description | Example |
|---|---|---|
| Utilization | % time busy | CPU usage |
| Saturation | Queue depth | Request backlog |
| Errors | Error count | 5xx responses |
RED Method (Services)¶
| Metric | Description | Example |
|---|---|---|
| Rate | Requests/sec | HTTP RPS |
| Errors | Failed requests | Error rate |
| Duration | Latency | p99 response time |
Pattern: SLI/SLO Definition¶
# Example SLO
slo:
name: api-availability
target: 99.9%
window: 30d
sli:
type: availability
good: status_code < 500
total: all_requests
Alerting Strategy¶
Alert on Symptoms, Not Causes
Alert when users are impacted, not when a component is unhealthy.
Alert Severity¶
| Severity | Response | Example |
|---|---|---|
| Critical | Page immediately | Service down |
| Warning | Review next day | Degraded performance |
| Info | Weekly review | Capacity trending |
Anti-Patterns¶
Avoid These
- Alerting on every error
- No runbooks for alerts
- Metrics without context
- Log everything (expensive!)