Grafana Operator¶
Grafana Operator (GitHub) is a Kubernetes operator that manages Grafana instances, dashboards, datasources, and folders through Custom Resource Definitions. Unlike file-based provisioning or API-driven CI/CD pipelines, the operator introduces a reconciliation loop that continuously ensures the declared state in Git matches the live state in Grafana — correcting drift from manual UI edits without requiring pod restarts.
The operator defines CRDs (GrafanaDashboard, GrafanaDatasource, GrafanaFolder) that are standard Kubernetes resources, meaning they participate in the same GitOps lifecycle as every other manifest in the cluster. Flux deploys them, kubectl inspects them, and RBAC controls who can create them — no separate Grafana API tokens or provisioning sidecars required.
Overview¶
| Property | Value |
|---|---|
| Namespace | grafana-operator |
| Type | HelmRelease (chart: grafana-operator vv5.15.1) |
| Layer | Grafana Operator |
| Chart | grafana-operator vv5.15.1 |
| Status | Enabled |
| Source | apps/base/grafana-operator/ |
Dependencies¶
Upstream — required before Grafana Operator starts¶
| Service | Reason | Status |
|---|---|---|
kube-prometheus-stack |
Flux dependsOn |
Active |
Downstream — services that depend on Grafana Operator¶
| Service | Dependency type | Reason |
|---|---|---|
grafana-config |
Flux dependsOn |
Requires Grafana Operator |
Purpose¶
Grafana Operator is the bridge between this platform's Git-managed observability definitions and the Grafana instance running inside kube-prometheus-stack. It watches for dashboard and datasource CRDs committed to the repository (delivered via the downstream grafana-config service), reconciles them into the running Grafana, and reverts any manual drift — completing the GitOps loop for the entire observability layer.
Features¶
| Feature | Detail |
|---|---|
| Cluster-wide CRD watching | Configured with namespaceScope: false, allowing GrafanaDashboard and GrafanaDatasource resources to be placed in any namespace alongside the services they monitor, rather than requiring all definitions to live in the operator's namespace. |
| Install and upgrade remediation | Both install and upgrade phases are configured with 3 automatic retries, providing resilience against transient failures during Helm chart deployment without requiring manual intervention. |
| Bounded resource allocation | Explicit resource requests and limits are set to prevent the operator from consuming unbounded memory during large reconciliation sweeps across many namespaces. |
| OCI-based chart distribution | The Helm chart is sourced from an OCI registry rather than a traditional HTTP Helm repository, leveraging container registry infrastructure for chart storage and distribution. |
Architecture¶
Grafana Operator Reconciliation Topology¶
graph TD
subgraph flux-system["flux-system namespace"]
FK["Flux Kustomization<br/>grafana-operator"]
HR["HelmRelease<br/>grafana-operator"]
REPO["HelmRepository OCI<br/>ghcr.io/grafana/helm-charts"]
end
subgraph grafana-operator-ns["grafana-operator namespace"]
OP["Grafana Operator Pod"]
end
subgraph kube-prometheus["kube-prometheus-stack"]
GRAFANA["Grafana Instance"]
end
subgraph any-namespace["any namespace"]
CRDs["GrafanaDashboard<br/>GrafanaDatasource<br/>GrafanaFolder CRDs"]
end
FK -->|"deploys"| HR
HR -->|"pulls chart"| REPO
HR -->|"installs into"| OP
FK -.->|"dependsOn"| kube-prometheus
CRDs -->|"watched by"| OP
OP -->|"provisions via API"| GRAFANA
Dashboard-as-Code GitOps Flow¶
sequenceDiagram
participant Git as Git Repository
participant Flux as Flux CD
participant GC as grafana-config Kustomization
participant OP as Grafana Operator
participant G as Grafana
Git->>Flux: Push dashboard CRD manifest
Flux->>GC: Reconcile grafana-config path
GC->>OP: CRD appears in cluster (any namespace)
OP->>G: Provision dashboard via internal API
Note over OP,G: Continuous reconciliation reverts UI drift
Configuration¶
All values sourced from base/services/environment.env
(base); per-environment overrides in clusters/stages/dev/.../environment.env.
No environment-specific configuration variables for this service.
Operations¶
HelmRelease stuck in "install retries exhausted"¶
Symptoms: flux get helmreleases -n flux-system grafana-operator shows False ready status with message "install retries exhausted". Events show HelmChart reconciliation failing or image pull errors.
kubectl get helmrelease grafana-operator -n flux-system -o yaml | grep -A5 'status:'
kubectl get events -n flux-system --field-selector involvedObject.name=grafana-operator --sort-by=.lastTimestamp
kubectl get helmchart -n flux-system | grep grafana-operator
# Check if OCI registry is reachable from cluster
kubectl run oci-test --rm -it --image=curlimages/curl -- curl -s https://ghcr.io/v2/grafana/helm-charts/grafana-operator/tags/list
# Force retry after fixing the root cause
flux suspend helmrelease grafana-operator -n flux-system
flux resume helmrelease grafana-operator -n flux-system
Operator pod OOMKilled under CRD load¶
Symptoms: Pod restarts with reason OOMKilled visible in kubectl describe pod. Increasing restart count. Dashboards stop reconciling during restart cycles. kubectl top pod -n grafana-operator shows memory approaching the 256Mi limit.
kubectl get pods -n grafana-operator -o wide
kubectl describe pod -n grafana-operator -l app.kubernetes.io/name=grafana-operator | grep -A3 "Last State"
kubectl top pod -n grafana-operator
# Check how many CRDs the operator is watching
kubectl get grafanadashboards -A --no-headers | wc -l
kubectl get grafanadatasources -A --no-headers | wc -l
# If CRD count is high, memory limit needs increasing in HelmRelease values
kubectl logs -n grafana-operator -l app.kubernetes.io/name=grafana-operator --previous --tail=50
Dashboards not appearing in Grafana after CRD creation¶
Symptoms: GrafanaDashboard CRs exist and show no error in their status, but dashboards are missing from the Grafana UI. Operator logs show "no matching Grafana instances found" or instanceSelector mismatch warnings.
kubectl get grafanadashboards -A -o custom-columns=NAME:.metadata.name,SYNCED:.status.conditions[0].status,MSG:.status.conditions[0].message
# Check operator logs for selector mismatch
kubectl logs -n grafana-operator -l app.kubernetes.io/name=grafana-operator --tail=100 | grep -i "instance"
# Verify Grafana instance labels match the instanceSelector in dashboards
kubectl get grafana -A --show-labels
# Compare with dashboard instanceSelector
kubectl get grafanadashboards -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.instanceSelector}{"\n"}{end}'
# Verify operator has RBAC to read CRDs in target namespace
kubectl auth can-i get grafanadashboards --as=system:serviceaccount:grafana-operator:grafana-operator -A
CRD version conflict after operator upgrade¶
Symptoms: After upgrading the operator chart, existing GrafanaDashboard CRs show validation errors. kubectl apply on dashboard manifests fails with "unknown field" or schema validation errors. Operator logs show "failed to reconcile" with conversion webhook errors.
kubectl get crd grafanadashboards.grafana.integreatly.org -o jsonpath='{.spec.versions[*].name}'
kubectl get crd grafanadashboards.grafana.integreatly.org -o jsonpath='{.status.storedVersions[*]}'
# Check if operator registered its webhook
kubectl get validatingwebhookconfigurations | grep grafana
kubectl get mutatingwebhookconfigurations | grep grafana
# Inspect a failing CRD for schema issues
kubectl get grafanadashboards -A -o yaml | head -80
# If CRDs are stale, force Flux to reapply the chart (which bundles updated CRDs)
flux reconcile helmrelease grafana-operator -n flux-system --force
Operator cannot reach Grafana API endpoint¶
Symptoms: Operator logs show repeated "connection refused" or "context deadline exceeded" errors when attempting to sync dashboards. CRD status shows "failed to get Grafana instance" or HTTP 503. Grafana pod itself is healthy but operator cannot connect.
kubectl logs -n grafana-operator -l app.kubernetes.io/name=grafana-operator --tail=50 | grep -iE "refused|timeout|503"
# Verify Grafana service is resolvable from operator namespace
kubectl run dns-test --rm -it --image=busybox -n grafana-operator -- nslookup kube-prometheus-stack-grafana.kube-prometheus-stack.svc.cluster.local
# Check network policies that might block cross-namespace traffic
kubectl get networkpolicies -n kube-prometheus-stack
kubectl get ciliumnetworkpolicies -n kube-prometheus-stack
# Verify Grafana admin credentials secret exists and is referenced correctly
kubectl get grafana -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.external.adminPassword}{"\n"}{end}'
Related¶
apps/base/grafana-operator/— Kubernetes manifestsbase/services/grafana-operator.yaml— Flux Kustomizationbase/services/environment.env— environment variables
Generated from service-catalog.json at commit 165b485 · catalog sha 4d088b0b3a67b4c4