Grafana SA Setup¶
Grafana exposes a Service Accounts API that issues scoped API tokens independent of any human user. Unlike basic auth or personal API keys tied to a user session, service account tokens survive password rotations, can be individually revoked, and carry explicit role bindings (Viewer, Editor, Admin). They are the recommended machine-to-machine authentication path for any automation that needs to query dashboards, manage alerts, or provision data sources programmatically.
However, Grafana's service account database is internal and ephemeral in default Helm deployments — tokens do not survive a pod restart unless backed by persistent storage. This creates a bootstrap problem: downstream services that depend on a valid Grafana token cannot simply mount a static secret, because the token becomes invalid every time Grafana's state resets.
grafana-sa-setup solves this with a self-healing Kubernetes Job that validates, creates, and externalizes a Grafana SA token into a secrets manager where other services can consume it through the standard ExternalSecrets flow.
Overview¶
| Property | Value |
|---|---|
| Namespace | grafana-sa-setup |
| Type | Job |
| Layer | Foundation services |
| Status | Enabled |
| Source | apps/base/grafana-sa-setup/ |
Dependencies¶
Upstream — required before Grafana SA Setup starts¶
| Service | Reason | Status |
|---|---|---|
kube-prometheus-stack |
Flux dependsOn |
Active |
Downstream — services that depend on Grafana SA Setup¶
| Service | Dependency type | Reason |
|---|---|---|
kagent |
Flux dependsOn |
Requires Grafana SA Setup |
Purpose¶
This Job exists to bridge Grafana's ephemeral service account store with the platform's secrets distribution layer. Specifically, it ensures that kagent always has a valid Grafana API token available — even after cluster restarts wipe Grafana's internal SA database.
The Job runs after kube-prometheus-stack brings Grafana online, creates (or validates) a service account named kagent with Admin role, generates a token, and writes it to LocalStack Secrets Manager at kagent/grafana/api-key. From there, ExternalSecrets syncs the value into a Kubernetes Secret that kagent mounts at runtime. The entire chain is automatic and requires no manual intervention after initial deployment.
Why a Job instead of Grafana's declarative provisioning (provisioning/access-control/): Grafana's file-based provisioning supports service accounts since v9.1, but it does not expose the generated token — it only ensures the account exists. There is no declarative way to extract the token value and push it elsewhere. A Job with API calls is the only path that both creates the account and captures the token for external storage.
Why not a persistent Grafana database: Persisting Grafana's SQLite/PostgreSQL state would make tokens survive restarts, but adds operational complexity (PVCs, backup/restore, migration) for a component whose dashboards and data sources are already provisioned declaratively. Treating Grafana as stateless with a token-bootstrap Job is simpler and aligns with the GitOps philosophy of reproducible state.
Features¶
| Feature | Detail |
|---|---|
| Idempotent token validation | Fetches the current token from LocalStack and validates it against Grafana's API before attempting creation, avoiding unnecessary token churn on routine reconciliations |
| Self-healing via TTL and Flux reconciliation | ttlSecondsAfterFinished deletes the completed Job after 300s; Flux recreates it on next interval, guaranteeing the token is refreshed after any cluster restart that invalidates Grafana's SA database |
| Init container / main container handoff | Token generation (curl against Grafana API) runs in an initContainer, passing the result to the main container (aws-cli for LocalStack storage) via a shared emptyDir volume at /shared |
| Least-privilege RBAC | A dedicated ServiceAccount with a Role scoped to a single verb (get) on a single Secret (grafana-admin-credentials) in the monitoring namespace — no cluster-wide permissions |
| Flux health gating | The Flux Kustomization declares a healthCheck on the Job resource, blocking downstream dependents (kagent) from reconciling until the token is confirmed stored |
Architecture¶
Job Execution Topology¶
graph TD
subgraph monitoring namespace
SECRET[Secret: grafana-admin-credentials]
SA[ServiceAccount: grafana-sa-setup]
INIT["initContainer: create-token<br/>(curlimages/curl)"]
MAIN["container: store-secret<br/>(amazon/aws-cli)"]
GRAFANA[Grafana API]
SHARED[emptyDir: /shared]
end
subgraph localstack namespace
LOCALSTACK[LocalStack Secrets Manager]
end
subgraph flux-system namespace
KUST[Kustomization: grafana-sa-setup]
end
KUST -->|dependsOn| KPS[Kustomization: kube-prometheus-stack]
KUST -->|healthCheck batch/v1 Job| INIT
SECRET -->|volumeMount /var/secrets/grafana| INIT
SA -->|serviceAccountName| INIT
INIT -->|"GET /api/serviceaccounts/search"| GRAFANA
INIT -->|"POST /api/serviceaccounts/{id}/tokens"| GRAFANA
INIT -->|writes grafana-token| SHARED
SHARED -->|reads grafana-token| MAIN
INIT -->|"POST secretsmanager.GetSecretValue"| LOCALSTACK
MAIN -->|"put-secret-value kagent/grafana/api-key"| LOCALSTACK
Token Bootstrap Flow¶
sequenceDiagram
participant Flux
participant Job as create-grafana-sa-token
participant LS as LocalStack
participant GF as Grafana API
Flux->>Job: Reconcile (after kube-prometheus-stack ready)
Job->>LS: GetSecretValue(kagent/grafana/api-key)
alt Token exists and non-placeholder
Job->>GF: GET /api/org (Bearer token)
alt HTTP 200
Job->>Job: Write SKIP to /shared/action
Job-->>Flux: Job Complete (no-op)
else HTTP != 200
Job->>GF: POST /api/serviceaccounts (name=kagent, role=Admin)
Job->>GF: POST /api/serviceaccounts/{id}/tokens
Job->>Job: Write token to /shared/grafana-token
Job->>LS: put-secret-value(kagent/grafana/api-key, token)
Job-->>Flux: Job Complete (refreshed)
end
else No token or placeholder
Job->>GF: POST /api/serviceaccounts (name=kagent, role=Admin)
Job->>GF: POST /api/serviceaccounts/{id}/tokens
Job->>Job: Write token to /shared/grafana-token
Job->>LS: put-secret-value(kagent/grafana/api-key, token)
Job-->>Flux: Job Complete (created)
end
Flux->>Flux: Unblock downstream (kagent)
Configuration¶
All values sourced from base/services/environment.env
(base); per-environment overrides in clusters/stages/dev/.../environment.env.
No environment-specific configuration variables for this service.
Operations¶
Job fails because Grafana is not yet healthy¶
Symptoms: Job pod in Init:Error or Init:CrashLoopBackOff. Init container logs show curl: (7) Failed to connect to kube-prometheus-stack-grafana.monitoring.svc.cluster.local port 80 or HTTP 503 responses. Flux Kustomization grafana-sa-setup shows Health check failed.
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
kubectl logs -n monitoring job/create-grafana-sa-token -c create-token --previous
kubectl get kustomization kube-prometheus-stack -n flux-system -o jsonpath='{.status.conditions[*].message}'
kubectl rollout status deployment/kube-prometheus-stack-grafana -n monitoring --timeout=120s
kubectl delete job create-grafana-sa-token -n monitoring && flux reconcile kustomization grafana-sa-setup
Token created but LocalStack PutSecretValue fails¶
Symptoms: Init container exits 0 but main container (store-secret) fails with AWS CLI errors such as Could not connect to the endpoint URL or An error occurred (ResourceNotFoundException). Job shows 1/1 init containers completed but pod status is Error.
kubectl logs -n monitoring job/create-grafana-sa-token -c store-secret
kubectl get pods -n localstack -l app.kubernetes.io/name=localstack
kubectl exec -n localstack deploy/localstack -- awslocal secretsmanager describe-secret --secret-id kagent/grafana/api-key --region us-east-1
kubectl exec -n localstack deploy/localstack -- awslocal secretsmanager create-secret --name kagent/grafana/api-key --secret-string placeholder-run-grafana-sa-setup --region us-east-1
kubectl delete job create-grafana-sa-token -n monitoring && flux reconcile kustomization grafana-sa-setup
grafana-admin-credentials Secret missing¶
Symptoms: Pod stuck in Pending or ContainerCreating with event MountVolume.SetUp failed for volume "grafana-admin-creds". The Secret grafana-admin-credentials does not exist in the monitoring namespace.
kubectl get secret grafana-admin-credentials -n monitoring
kubectl get helmrelease kube-prometheus-stack -n flux-system -o jsonpath='{.status.conditions[*].message}'
kubectl get pods -n monitoring -l app.kubernetes.io/instance=kube-prometheus-stack
flux reconcile helmrelease kube-prometheus-stack -n flux-system
Job completed but kagent still cannot authenticate to Grafana¶
Symptoms: kagent logs show 401 Unauthorized when calling Grafana API. The Job shows Completed status. ExternalSecret in kagent's namespace may be out of sync or holding a stale token.
kubectl get externalsecret -A | grep grafana
kubectl exec -n localstack deploy/localstack -- awslocal secretsmanager get-secret-value --secret-id kagent/grafana/api-key --region us-east-1 --query SecretString --output text
curl -sf -H "Authorization: Bearer $(kubectl exec -n localstack deploy/localstack -- awslocal secretsmanager get-secret-value --secret-id kagent/grafana/api-key --region us-east-1 --query SecretString --output text)" http://localhost:$(kubectl get svc kube-prometheus-stack-grafana -n monitoring -o jsonpath='{.spec.ports[0].port}')/api/org
kubectl delete job create-grafana-sa-token -n monitoring && flux reconcile kustomization grafana-sa-setup
kubectl annotate externalsecret -n flux-system kagent-grafana-sa-token force-sync=$(date +%s) --overwrite
Job exceeds backoffLimit with intermittent Grafana API errors¶
Symptoms: Job status shows Failed with reason BackoffLimitExceeded. Multiple pod attempts visible with kubectl get pods -n monitoring -l job-name=create-grafana-sa-token. Logs show mixed success/failure on Grafana API calls (HTTP 500, connection resets).
kubectl get pods -n monitoring -l job-name=create-grafana-sa-token --sort-by=.metadata.creationTimestamp
kubectl logs -n monitoring -l job-name=create-grafana-sa-token -c create-token --tail=50
kubectl get events -n monitoring --field-selector involvedObject.name=create-grafana-sa-token --sort-by=.lastTimestamp
kubectl top pods -n monitoring -l app.kubernetes.io/name=grafana
kubectl delete job create-grafana-sa-token -n monitoring && flux reconcile kustomization grafana-sa-setup
Related¶
apps/base/grafana-sa-setup/— Kubernetes manifestsbase/services/grafana-sa-setup.yaml— Flux Kustomizationbase/services/environment.env— environment variables
Generated from service-catalog.json at commit 165b485 · catalog sha 4d088b0b3a67b4c4