Dashboards¶
Detailed documentation for all Grafana dashboards managed in this repository.
LLM Metrics Dashboard¶
A comprehensive monitoring dashboard for AI/LLM workloads, tracking request performance, token throughput, error rates, resource utilization, and GPU metrics across the agentic-ai platform.
| Property | Value |
|---|---|
| UID | llm-metrics |
| Title | LLM Metrics Dashboard |
| Tags | llm, ai, performance |
| Datasource | Prometheus |
| Default Time Range | Last 1 hour |
| Auto-Refresh | 30 seconds |
| Namespace | monitoring (CRD), queries scoped to agentic-ai |
Dashboard Layout¶
flowchart TB
subgraph "LLM Metrics Dashboard"
subgraph "Section 1: LLM Metrics (7 panels)"
direction LR
P1[Request Rate]
P2[Latency P50/P95/P99]
P3[Token Throughput]
P4[Time to First Token]
P5[Error Rate]
P6[Active Requests]
P7[Estimated Cost]
end
subgraph "Section 2: Resource Utilization (4 panels)"
direction LR
P8[CPU % of Limit]
P9[Memory % of Limit]
P10[CPU Cores Used]
P11[Memory Used]
end
subgraph "Section 3: Mac GPU Metrics (4 panels)"
direction LR
P12[GPU Utilization Gauge]
P13[GPU Utilization Over Time]
P14[GPU Frequency]
P15[DVFM State Distribution]
end
end
LLM Metrics¶
Core performance metrics for LLM inference operations.
| Panel | Visualization | Query Summary | Unit |
|---|---|---|---|
| LLM Request Rate | Timeseries | rate(llm_requests_total[5m]) |
req/s |
| LLM Latency P50/P95/P99 | Timeseries | histogram_quantile(0.5/0.95/0.99, rate(llm_request_duration_seconds_bucket[5m])) |
seconds |
| Token Throughput | Timeseries | rate(llm_tokens_input_total[5m]) + rate(llm_tokens_output_total[5m]) |
tokens/s |
| Time to First Token | Timeseries | histogram_quantile(0.5/0.95/0.99, rate(llm_time_to_first_token_seconds_bucket[5m])) |
seconds |
| LLM Error Rate | Timeseries | rate(llm_errors_total[5m]) |
errors/s |
| Active Requests | Gauge | llm_active_requests |
count (0-100) |
| Estimated Cost | Stat | rate(llm_cost_dollars_total[1h]) * 3600 |
$/hr |
Latency Percentiles
The latency panel displays three series (P50, P95, P99) on the same chart, providing a complete view of request duration distribution. P99 is the most important for SLA monitoring.
Time to First Token (TTFT)
TTFT measures the delay between sending a request and receiving the first token back — critical for perceived responsiveness in streaming chat applications.
Resource Utilization¶
Container-level resource metrics for the agentic-ai namespace, showing both percentage-of-limit and absolute values.
| Panel | Visualization | Query Summary | Unit |
|---|---|---|---|
| CPU Usage (% of Limit) | Timeseries | rate(container_cpu_usage_seconds_total{namespace="agentic-ai"}[5m]) / container_spec_cpu_quota * container_spec_cpu_period * 100 |
percent |
| Memory Usage (% of Limit) | Timeseries | container_memory_working_set_bytes{namespace="agentic-ai"} / container_spec_memory_limit_bytes * 100 |
percent |
| CPU Cores Used | Timeseries | rate(container_cpu_usage_seconds_total{namespace="agentic-ai"}[5m]) |
cores |
| Memory Used | Timeseries | container_memory_working_set_bytes{namespace="agentic-ai"} |
bytes |
Resource Limits Required
The percentage panels require that resource limits are set on the agentic-ai pods. Without limits, the denominator is zero and panels will show no data.
Mac GPU Metrics¶
Hardware GPU metrics for local Mac development, monitoring Apple Silicon GPU performance via custom exporters.
| Panel | Visualization | Query Summary | Unit | Notes |
|---|---|---|---|---|
| GPU Utilization | Gauge | mac_gpu_utilization_percent |
percent | Thresholds: green < 60%, yellow < 80%, red >= 80% |
| GPU Utilization Over Time | Timeseries | mac_gpu_utilization_percent |
percent | Historical trend view |
| GPU Frequency | Timeseries | mac_gpu_frequency_mhz |
MHz | Current GPU clock speed |
| DVFM State Distribution | Pie Chart | mac_gpu_dvfm_states by state |
— | Dynamic Voltage and Frequency Management states |
Mac GPU Metrics
These panels require a custom Prometheus exporter that collects Apple Silicon GPU metrics. They are only relevant when running Ollama on Mac hardware.
Metrics Reference¶
Complete reference for all custom Prometheus metrics used by the LLM Metrics Dashboard.
LLM Metrics¶
| Metric | Type | Description |
|---|---|---|
llm_requests_total |
Counter | Total number of LLM inference requests |
llm_request_duration_seconds_bucket |
Histogram | LLM request duration distribution (for percentile calculations) |
llm_tokens_input_total |
Counter | Total input tokens processed |
llm_tokens_output_total |
Counter | Total output tokens generated |
llm_time_to_first_token_seconds_bucket |
Histogram | Time to first token distribution |
llm_errors_total |
Counter | Total LLM inference errors |
llm_active_requests |
Gauge | Currently active/in-flight LLM requests |
llm_cost_dollars_total |
Counter | Estimated cumulative cost in dollars |
Mac GPU Metrics¶
| Metric | Type | Description |
|---|---|---|
mac_gpu_utilization_percent |
Gauge | Current GPU utilization percentage (0-100) |
mac_gpu_frequency_mhz |
Gauge | Current GPU clock frequency in MHz |
mac_gpu_dvfm_states |
Gauge | DVFM state distribution (labeled by state) |
Standard Container Metrics¶
| Metric | Type | Source |
|---|---|---|
container_cpu_usage_seconds_total |
Counter | cAdvisor (kube-prometheus-stack) |
container_memory_working_set_bytes |
Gauge | cAdvisor (kube-prometheus-stack) |
container_spec_cpu_quota |
Gauge | cAdvisor (container CPU limit) |
container_spec_cpu_period |
Gauge | cAdvisor (CPU scheduling period) |
container_spec_memory_limit_bytes |
Gauge | cAdvisor (container memory limit) |
Adding New Dashboards¶
Step-by-Step Guide¶
-
Design the dashboard in the Grafana UI (use the fleet-infra Grafana instance)
-
Export the JSON from Grafana: Dashboard Settings → JSON Model → Copy
-
Create the dashboard directory under the appropriate project folder:
-
Save the JSON as a reference file:
-
Create the GrafanaDashboard CRD manifest:
-
Create a GrafanaFolder (if needed for a new project):
-
Add to kustomization.yaml:
-
Commit and push — Flux CD will sync the changes automatically
Dashboard Best Practices¶
Use Consistent Datasource References
Always use "datasource": {"type": "prometheus", "uid": "prometheus"} to reference the Prometheus datasource. Avoid hardcoding datasource names that may differ across environments.
Meaningful UIDs
Assign readable UIDs to dashboards (e.g., llm-metrics instead of auto-generated UUIDs). This makes URLs predictable and bookmarkable.
Folder Organization
Group dashboards by project using GrafanaFolder CRDs. Each project gets its own folder in Grafana (e.g., agentic-ai/), keeping the dashboard list organized.
JSON Size Limits
Large dashboard JSON embedded in CRDs can hit etcd size limits. For very complex dashboards (50+ panels), consider using url or configMapRef instead of inline json.