Redis Sentinel¶

Redis (GitHub) is an open-source, in-memory data structure store. Unlike traditional databases that persist to disk first, Redis operates entirely in memory with optional persistence, delivering sub-millisecond latency at any scale. What distinguishes it from other caching solutions (Memcached, Varnish, Hazelcast): it supports rich data structures (strings, hashes, lists, sets, sorted sets, streams), native pub/sub, Lua scripting, and atomic operations — making it useful well beyond simple key-value caching.

Redis Sentinel is Redis's built-in high availability layer. It runs as a separate process alongside Redis nodes, continuously monitors master/replica health, and performs automatic leader election and failover without any external orchestrator. This is deployed in master-replica mode with Sentinel-managed automatic failover.

Overview¶

Property	Value
Namespace	`redis-sentinel`
Type	HelmRelease (chart: `redis` v20.7.0)
Layer	Database services
Chart	`redis` v20.7.0
Status	Enabled
Source	`apps/base/redis-sentinel/`

Dependencies¶

Upstream — required before Redis Sentinel starts¶

Service	Reason	Status
`external-secrets-config`	Flux `dependsOn`	Effectively inactive — `auth.enabled: false` in the HelmRelease and `externalsecret.yaml` is excluded from `apps/base/redis-sentinel/kustomization.yaml`, so no password is synced or used. The `redis-password` health check in the Flux Kustomization is a stale leftover and should be removed if auth remains disabled.

Downstream — services that depend on Redis Sentinel¶

Service	Dependency type	Reason
`argocd`	Flux `dependsOn`	Requires Redis Sentinel
`redisinsight`	Flux dependsOn	UI has no value without a running Redis instance

kagent and Loki are not Flux-wired to redis-sentinel because they degrade gracefully without it (kagent falls back to direct A2A dispatch; Loki skips cache). Hard dependsOn would block their entire deployment on a Redis outage.

Purpose¶

Redis Sentinel is the platform's shared in-memory data layer. It runs in replication mode with co-located Sentinel sidecars that continuously monitor master health and trigger automatic promotion of a replica when the master fails — keeping write-path downtime in the 15–30 second range rather than requiring human intervention.

Why in-cluster Sentinel over a managed service (e.g. AWS ElastiCache): Managed Redis is operationally simpler, but it hides the Sentinel protocol behind a proprietary failover mechanism. Running Sentinel in-cluster gives full control over quorum configuration, per-database selection, eviction policy, persistence tuning, and — critically — lets clients use the Sentinel discovery protocol directly, which is required for correct failover behavior in most Redis client libraries. The trade-off is self-managed operations and disaster recovery.

Consumers¶

Each service uses a dedicated database index to avoid key collisions. New consumers should claim the next unused index.

Service	DB	Usage pattern	Connection method	Notes
kagent alertmanager-hook	4	Writes `incident:events` stream when Alertmanager fires a critical alert	Sentinel :26379	Producer side of incident queue
kagent stream-dispatcher	4	Consumes `incident:events` stream, consumer group `dispatch-group`; KEDA scales replicas on pending entry count	Sentinel :26379	KEDA uses `redis-sentinel-streams` scaler, not `redis-streams`
Loki	—	Multi-layer cache (query results, chunks, index, write deduplication)	Direct master :6379	Cache currently disabled in this cluster config. When enabled, must bypass Sentinel due to missing `SentinelPassword` in Loki's go-redis wrapper — see ADR-002

Features¶

Feature	Detail
Replication topology	1 master + N read replicas; reads can be distributed across replicas to reduce master load
Sentinel quorum	Co-located Sentinel sidecars ping every node on a short interval; failover only fires when a quorum independently declares the master down (two-phase: subjectively down → objectively down), preventing split-brain false positives
Automatic master reconfiguration	After failover, the old master rejoins as a replica of the newly elected master; no manual intervention needed
Two-layer HA strategy	Sentinel covers instance-level failures (pod crash, OOMKill); RDB snapshots cover catastrophic failures (all pods lost, node failure)
RDB snapshots	Periodic persistence (`save 10800 1`, every 3 h if ≥1 key changed) with gzip compression and checksum validation; binary format is significantly more compact than AOF
AOF disabled	Redis is a cache, not a source of truth; data regenerates from the underlying databases on miss. AOF's per-write disk I/O (10 000× slower than memory) is an unnecessary cost for this workload
LRU eviction	`allkeys-lru` policy with configurable `maxmemory` ceiling; cache self-manages under memory pressure without requiring TTL on every key
Authentication	Disabled (`auth.enabled: false`). To re-enable, add `externalsecret.yaml` back to `apps/base/redis-sentinel/kustomization.yaml` and set `auth.enabled: true`
Prometheus metrics	Optional `redis_exporter` sidecar + ServiceMonitor; toggled via `REDIS_METRICS_ENABLED`
Soft pod anti-affinity	Replicas prefer different nodes without hard-failing on single-node dev clusters

Architecture¶

Steady-state topology¶

graph TD
    subgraph redis-sentinel namespace
        direction TB
        M[redis-sentinel-master-0\nStatefulSet]
        R0[redis-sentinel-replicas-0\nStatefulSet]
        R1[redis-sentinel-replicas-1\nStatefulSet]

        S0[Sentinel sidecar\nmaster-0\n:26379]
        S1[Sentinel sidecar\nreplicas-0\n:26379]
        S2[Sentinel sidecar\nreplicas-1\n:26379]

        M -- replication stream --> R0
        M -- replication stream --> R1

        S0 -. heartbeat .-> M
        S0 -. heartbeat .-> R0
        S0 -. heartbeat .-> R1
        S1 -. heartbeat .-> M
        S2 -. heartbeat .-> M
    end

    subgraph Clients
        Loki[Loki\ncache layers]
        App[Application\nservices]
        RI[RedisInsight UI]
    end

    Loki -- redis-sentinel-master svc :6379\ndirect — bypasses Sentinel protocol --> M
    App -- redis-sentinel svc :26379\nSentinel discovery → ask for current master --> S0
    RI -- redis-sentinel-master svc :6379 --> M

Failover sequence¶

sequenceDiagram
    participant S0 as Sentinel-0 (master-0)
    participant S1 as Sentinel-1 (replicas-0)
    participant S2 as Sentinel-2 (replicas-1)
    participant R0 as replicas-0 (Redis)
    participant App as App client

    Note over S0,S2: Master stops responding to PING

    S0->>S0: Mark master as S-DOWN\n(subjectively down — my opinion only)
    S1->>S1: Mark master as S-DOWN
    S0->>S1: Ask: "Do you also see master as down?"
    S1-->>S0: "Yes, S-DOWN confirmed"
    S0->>S0: Quorum reached (2/3)\nMark master as O-DOWN\n(objectively down — consensus)

    Note over S0,S2: Leader election among Sentinels
    S0->>S0: Elected failover leader

    S0->>R0: SLAVEOF NO ONE — promote to master
    R0-->>S0: OK, now master
    S2->>R0: SLAVEOF replicas-0 — start replicating from new master

    Note over S0: Old master-0 pod recovers
    S0->>M: SLAVEOF replicas-0 — you are now a replica

    App->>S1: "Where is the master?"
    S1-->>App: "replicas-0:6379"
    App->>R0: Read/write operations continue

Configuration¶

All values sourced from base/services/environment.env (base); per-environment overrides in clusters/stages/dev/.../environment.env.

Parameter	Dev	Prod
`REDIS_CHART_VERSION`	`20.7.0`	`20.7.0`
`REDIS_MASTER_CPU_LIMIT`	`250m`	`1000m`
`REDIS_MASTER_CPU_REQUEST`	`250m`	`200m`
`REDIS_MASTER_MEMORY_LIMIT`	`2Gi`	`1Gi`
`REDIS_MASTER_MEMORY_REQUEST`	`2Gi`	`512Mi`
`REDIS_MAXMEMORY`	`1800mb`	`900mb`
`REDIS_METRICS_ENABLED`	`false`	`true`
`REDIS_REPLICA_COUNT`	`1`	`2`
`REDIS_REPLICA_CPU_LIMIT`	`250m`	`1000m`
`REDIS_REPLICA_CPU_REQUEST`	`250m`	`200m`
`REDIS_REPLICA_MEMORY_LIMIT`	`2Gi`	`1Gi`
`REDIS_REPLICA_MEMORY_REQUEST`	`2Gi`	`512Mi`
`REDIS_SENTINEL_CPU_LIM`	`200m`	`500m`
`REDIS_SENTINEL_CPU_REQ`	`50m`	`100m`
`REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS`	`30000`	`30000`
`REDIS_SENTINEL_FAILOVER_TIMEOUT`	`180000`	`180000`
`REDIS_SENTINEL_MEM_LIM`	`128Mi`	`256Mi`
`REDIS_SENTINEL_MEM_REQ`	`64Mi`	`128Mi`
`REDIS_SENTINEL_PARALLEL_SYNCS`	`1`	`1`
`REDIS_STORAGE_SIZE`	`8Gi`	`16Gi`

Access¶

In-cluster connection strings¶

# Via Sentinel (recommended for app clients — handles failover automatically)
redis-sentinel.redis-sentinel.svc.cluster.local:26379
masterSet: mymaster

# Via master service (direct — used by Loki; bypasses Sentinel protocol)
redis-sentinel-master.redis-sentinel.svc.cluster.local:6379

Use the :26379 Sentinel port for any client that speaks the Redis Sentinel protocol. Use :6379 direct only when the client cannot authenticate to the Sentinel port (e.g. Loki — see ADR-002).

RedisInsight UI¶

Browse and query Redis from a browser at http://redis.local (requires make setup-dns). Connect RedisInsight to redis-sentinel-master.redis-sentinel.svc.cluster.local:6379.

Port-forward (alternative)¶

kubectl port-forward -n redis-sentinel svc/redis-sentinel-master 6379:6379
# Then connect any Redis client to localhost:6379

Operations¶

Verification¶

# Flux Kustomization health
flux get kustomization redis-sentinel

# StatefulSet readiness
kubectl get statefulset -n redis-sentinel
# Expected:
#   redis-sentinel-master    1/1
#   redis-sentinel-replicas  2/2

# Pod status
kubectl get pods -n redis-sentinel

# Confirm Sentinel knows the master
kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis-sentinel \
  -- redis-cli -p 26379 sentinel masters
# Expected: name=mymaster, status=ok, num-slaves=2

# Ping the master
kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis \
  -- redis-cli ping
# Expected: PONG

# ExternalSecret synced
kubectl get externalsecret -n redis-sentinel redis-password
# Expected: READY   True

Redis CLI¶

Open an interactive shell on the master

kubectl exec -it -n redis-sentinel redis-sentinel-master-0 -c redis -- redis-cli

Common commands

# List all keys (use SCAN in prod — KEYS blocks on large datasets)
SCAN 0 MATCH * COUNT 100

# Inspect a key
TYPE <key>
TTL <key>
GET <key>          # string
HGETALL <key>      # hash
XLEN <key>         # stream length

# Check server info
INFO server
INFO replication   # shows role, connected_slaves, master_replid
INFO memory        # used_memory_human, maxmemory

# Monitor live commands (exit with Ctrl+C)
MONITOR

# Count all keys across all DBs
INFO keyspace

# Select a specific database
SELECT 4           # kagent stream DB

# Delete a single key
DEL <key>

# Check stream contents (incident queue)
XRANGE incident:events - + COUNT 10
XINFO GROUPS incident:events
XINFO CONSUMERS incident:events dispatch-group

Common Operations¶

Force Flux reconciliation

flux reconcile kustomization redis-sentinel --with-source

Restart all Redis pods (rolling)

kubectl rollout restart statefulset -n redis-sentinel

Trigger manual Sentinel failover

Test HA without killing master:

kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis-sentinel \
  -- redis-cli -p 26379 sentinel failover mymaster

Check memory usage

kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis \
  -- redis-cli info memory | grep used_memory_human

Check replication lag on a replica

kubectl exec -n redis-sentinel redis-sentinel-replicas-0 -c redis \
  -- redis-cli info replication | grep -E "master_sync|master_last_io"

Flush all keys (⚠ non-reversible — dev only)

kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis \
  -- redis-cli flushall

Re-sync ExternalSecret (if password secret drifts)

kubectl annotate externalsecret redis-password -n redis-sentinel \
  force-sync=$(date +%s) --overwrite

Troubleshooting¶

Sentinel reports no master / `+sdown master`¶

Cause: Master pod restarted and Sentinel has not yet completed leader election, or quorum is not met because fewer than 2 Sentinel sidecars are running.

Resolution:

Check master pod: kubectl get pods -n redis-sentinel -l app.kubernetes.io/component=master
Check events: kubectl describe pod -n redis-sentinel redis-sentinel-master-0
If pod is Pending, check PVC: kubectl get pvc -n redis-sentinel
Election completes after downAfterMilliseconds (30 s) + failoverTimeout (up to 3 min). Watch Sentinel logs:
```
kubectl logs -n redis-sentinel redis-sentinel-master-0 -c redis-sentinel -f
```
Look for +odown (objective down declared), then +elected-leader, then +promoted-slave.

Loki cache errors: `NOAUTH Authentication required`¶

Cause: Loki's go-redis wrapper omits the SentinelPassword field, so it cannot authenticate to the Sentinel port (26379).

Resolution:

Loki must target the master service directly on :6379, not via Sentinel discovery:

# Loki HelmRelease values — correct connection config
storage_config:
  redis:
    endpoint: redis-sentinel-master.redis-sentinel.svc.cluster.local:6379

Do not use :26379 in Loki config.

See also: ADR-002

Stale or missing cache data after Redis was down for an extended period¶

Cause: Redis persistence is RDB-only with a 3-hour snapshot interval (save 10800 1). When pods restart, Redis replays the last dump.rdb from the PVC. Any data written in the window between the last snapshot and the crash is lost. For a cache this is expected, but callers need to handle it correctly.

Resolution:

Verify applications treat every cache miss as a valid code path and re-populate from the source of truth (database, upstream API) on miss. The cache should never be the last copy of authoritative data.
If Redis was down for longer than the snapshot interval (3 h), assume the cache is fully cold and expect a period of elevated backend load while it warms up.
If a downstream service appears to serve stale data after recovery, flush its specific key namespace rather than the entire cache:

# Example: flush all keys matching a pattern (SCAN-based, avoids KEYS on large datasets)
kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis -- \
  redis-cli --scan --pattern "your:key:prefix:*" | \
  xargs kubectl exec -n redis-sentinel redis-sentinel-master-0 -c redis -- redis-cli del

`WRONGTYPE` or stale key errors immediately after failover¶

Cause: Brief replication lag — a replica promoted to master may be missing the last few writes from the old master if it died mid-replication cycle (parallel-syncs: 1 limits concurrent resync, reducing but not eliminating this window).

Resolution:

This is expected behavior for an eventually-consistent cache tier. Confirm the application handles cache misses gracefully (populate from source of truth on miss). If the data absolutely cannot be stale, it belongs in PostgreSQL, not the cache.

RedisInsight — browser UI for Redis (http://redis.local)
Loki — uses Redis as multi-layer cache; see direct-connection workaround
ADR-002: Loki Redis Direct Connection
apps/base/redis-sentinel/ — Kubernetes manifests
base/services/redis-sentinel.yaml — Flux Kustomization with dependsOn
base/services/environment.env — all REDIS_* tuning variables
apps/base/redis-sentinel/ — Kubernetes manifests
base/services/redis-sentinel.yaml — Flux Kustomization
base/services/environment.env — environment variables

Generated from service-catalog.json at commit afaa467 · catalog sha c57d485ae12bc679

Redis Sentinel¶

Overview¶

Dependencies¶

Upstream — required before Redis Sentinel starts¶

Downstream — services that depend on Redis Sentinel¶

Purpose¶

Consumers¶

Features¶

Architecture¶

Steady-state topology¶

Failover sequence¶

Configuration¶

Access¶

In-cluster connection strings¶

RedisInsight UI¶

Port-forward (alternative)¶

Operations¶

Verification¶

Redis CLI¶

Common Operations¶

Troubleshooting¶

Sentinel reports no master / +sdown master¶

Loki cache errors: NOAUTH Authentication required¶

Stale or missing cache data after Redis was down for an extended period¶

WRONGTYPE or stale key errors immediately after failover¶

Related¶

Sentinel reports no master / `+sdown master`¶

Loki cache errors: `NOAUTH Authentication required`¶

`WRONGTYPE` or stale key errors immediately after failover¶