Temporal¶
Temporal (GitHub) is a durable execution platform that guarantees workflow completion regardless of infrastructure failures. Unlike traditional job queues or state machines that lose progress on crash, Temporal persists every workflow state transition to a database — enabling automatic retry, resumption from the exact point of failure, and indefinite workflow lifetimes (minutes to months). What distinguishes it from alternatives like Airflow, Step Functions, or Celery: Temporal workflows are written in general-purpose code (Go, Java, TypeScript, Python) rather than DAGs or JSON state machines, with the runtime transparently handling retries, timeouts, compensation, and distributed transactions.
The architecture separates concerns across four server roles: Frontend (API gateway, rate limiting, routing), History (workflow state machine execution, event sourcing), Matching (task queue dispatch, worker polling), and Worker (internal system workflows — archival, replication, batch operations). Each role scales independently and communicates via gRPC. Persistence is pluggable — Cassandra, MySQL, or PostgreSQL for the execution store, with a separate visibility store enabling complex workflow queries.
Temporal's durability model is built on event sourcing: every workflow decision and activity result is appended to an immutable history log in the persistence layer. On recovery, the server replays this history to reconstruct exact workflow state without re-executing side effects. This makes Temporal suitable for orchestrating multi-service sagas, long-running human-in-the-loop processes, and scheduled batch pipelines where partial progress must never be lost.
Overview¶
| Property | Value |
|---|---|
| Namespace | temporal |
| Type | HelmRelease (chart: temporal v0.51.0) |
| Layer | Application services |
| Chart | temporal v0.51.0 |
| Status | Enabled |
| Source | apps/base/temporal/ |
Dependencies¶
Upstream — required before Temporal starts¶
| Service | Reason | Status |
|---|---|---|
external-secrets-config |
Flux dependsOn |
Active |
postgresql-cluster |
Flux dependsOn |
Active |
Downstream — services that depend on Temporal¶
No known downstream Flux dependencies.
Purpose¶
Temporal serves as the platform's durable workflow orchestration engine — the coordination layer that manages multi-step, failure-prone processes across other services. Rather than building ad-hoc retry logic, state tracking, and failure recovery into each application, services delegate complex orchestration to Temporal and implement only the business logic as stateless activity handlers.
Concrete workloads include: multi-service data pipelines that must complete atomically, scheduled batch processing with checkpoint/resume semantics, and long-running automation workflows that survive pod restarts and node failures. Temporal replaces what would otherwise be fragile cron jobs with state scattered across Redis keys and database flags.
Why Temporal over simpler alternatives: Celery/BullMQ handle task queues well but lack workflow-level state, compensation logic, and cross-service saga support. Airflow is DAG-oriented and Python-only, poorly suited to general-purpose orchestration with dynamic branching. AWS Step Functions are vendor-locked and JSON-defined. Temporal uniquely combines: code-native workflow definitions, automatic retry with exponential backoff at both activity and workflow level, built-in versioning for zero-downtime workflow evolution, and a visibility layer for operational querying of running workflows.
Why self-hosted over Temporal Cloud: Local-first development on Colima/Kind requires the full server stack accessible without internet. Self-hosting also enables tight integration with the shared PostgreSQL cluster (ADR-004) and future KEDA-based worker autoscaling (ADR-011) using task queue depth as the scaling signal.
Features¶
| Feature | Detail |
|---|---|
| PostgreSQL-backed dual-store persistence | Both the execution store (temporal database) and visibility store (temporal_visibility database) use the postgres12_pgx driver against the shared CNPG cluster, with Cassandra and Elasticsearch explicitly disabled. |
| Four-role server topology | Frontend, History, Matching, and Worker roles each run as separate replica sets within the same HelmRelease, enabling independent scaling per role. |
| Automatic schema migration | Schema setupDatabase and updateDatabase are enabled while createDatabase is disabled — Temporal auto-migrates its schema on upgrade but expects databases to be pre-provisioned by CNPG. |
| ExternalSecret credential injection | PostgreSQL credentials are pulled from LocalStack via ClusterSecretStore into a Kubernetes Secret with cnpg.io/reload label, enabling automatic credential rotation propagation. |
| Web UI with Traefik ingress | Temporal Web UI is exposed at temporal.local via Traefik IngressRoute on port 8080, providing workflow visibility, namespace management, and execution inspection. |
| Blob retention policy | Server runs with --blobsRetentionDays=7, limiting internal blob storage growth for completed workflow data. |
| Default namespace with 3-day retention | A default Temporal namespace is auto-created with 3-day closed workflow retention, balancing query visibility against storage growth. |
| Monitoring stack disabled | All bundled Prometheus, Grafana, and ServiceMonitor resources are disabled in favor of the platform-wide kube-prometheus-stack deployment. |
Architecture¶
Temporal Deployment Topology¶
graph TD
subgraph flux-system["flux-system namespace"]
HR[HelmRelease: temporal-server]
REPO[HelmRepository: temporal]
KS[Kustomization: temporal]
end
subgraph temporal-ns["temporal namespace"]
FE[Frontend x2]
HI[History x2]
MA[Matching x2]
WK[Worker x2]
WEB[temporal-server-web]
ES[ExternalSecret: postgresql-cluster-app]
SEC[Secret: postgresql-cluster-app]
end
subgraph cnpg["cnpg-system namespace"]
PG[postgresql-cluster-rw]
DB1[(temporal)]
DB2[(temporal_visibility)]
end
subgraph infra["Infrastructure"]
LS[LocalStack SecretStore]
TR[Traefik]
end
KS -->|"dependsOn"| ESC[external-secrets-config]
KS -->|"dependsOn"| PGC[postgresql-cluster]
HR -->|"sourceRef"| REPO
ES -->|"ClusterSecretStore"| LS
ES -->|"creates"| SEC
FE -->|"sql :5432"| PG
HI -->|"sql :5432"| PG
MA -->|"sql :5432"| PG
WK -->|"sql :5432"| PG
PG --- DB1
PG --- DB2
FE -.->|"credentials"| SEC
HI -.->|"credentials"| SEC
MA -.->|"credentials"| SEC
WK -.->|"credentials"| SEC
TR -->|"Host: temporal.local :8080"| WEB
Temporal Request Flow¶
sequenceDiagram
participant Client as Temporal Client
participant FE as Frontend Service
participant MA as Matching Service
participant HI as History Service
participant WK as External Worker
participant PG as PostgreSQL
Client->>FE: StartWorkflowExecution (gRPC)
FE->>HI: Create workflow state
HI->>PG: Persist workflow event (temporal db)
HI->>MA: Schedule decision task
MA->>WK: Dispatch task (long poll)
WK->>FE: RespondDecisionTaskCompleted
FE->>HI: Record decision
HI->>PG: Append history events
HI->>PG: Update visibility (temporal_visibility db)
Configuration¶
All values sourced from base/services/environment.env
(base); per-environment overrides in clusters/stages/dev/.../environment.env.
| Parameter | Dev | Prod |
|---|---|---|
TEMPORAL_CHART_VERSION |
0.51.0 |
0.51.0 |
TEMPORAL_DB_NAME |
temporal |
temporal |
TEMPORAL_SERVER_CPU_LIMIT |
1000m |
4000m |
TEMPORAL_SERVER_CPU_REQUEST |
1000m |
1000m |
TEMPORAL_SERVER_MEMORY_LIMIT |
1Gi |
4Gi |
TEMPORAL_SERVER_MEMORY_REQUEST |
1Gi |
2Gi |
TEMPORAL_STORAGE_SIZE |
5Gi |
20Gi |
TEMPORAL_VISIBILITY_DB_NAME |
temporal_visibility |
temporal_visibility |
TEMPORAL_WEB_CPU_LIMIT |
250m |
1000m |
TEMPORAL_WEB_CPU_REQUEST |
250m |
200m |
TEMPORAL_WEB_MEMORY_LIMIT |
256Mi |
1Gi |
TEMPORAL_WEB_MEMORY_REQUEST |
256Mi |
512Mi |
Operations¶
Schema migration job fails on upgrade¶
Symptoms: HelmRelease stuck in upgrade retries exhausted state. Schema init-container or Job shows Error: failed to execute statement or pq: relation already exists. Temporal server pods never start because Helm upgrade never completes.
kubectl get helmrelease temporal-server -n flux-system -o jsonpath='{.status.conditions[*].message}'
kubectl get jobs -n temporal -l app.kubernetes.io/component=schema --sort-by=.metadata.creationTimestamp
kubectl logs job/$(kubectl get jobs -n temporal -l app.kubernetes.io/component=schema -o jsonpath='{.items[-1].metadata.name}') -n temporal
kubectl exec -it postgresql-cluster-1 -n cnpg-system -- psql -U app -d temporal -c "SELECT * FROM schema_version ORDER BY version_id DESC LIMIT 5;"
kubectl exec -it postgresql-cluster-1 -n cnpg-system -- psql -U app -d temporal_visibility -c "SELECT * FROM schema_version ORDER BY version_id DESC LIMIT 5;"
# If schema is corrupted, manually mark version and retry:
flux suspend helmrelease temporal-server -n flux-system
kubectl delete jobs -n temporal -l app.kubernetes.io/component=schema
flux resume helmrelease temporal-server -n flux-system
Frontend pods unable to connect to PostgreSQL¶
Symptoms: Temporal frontend pods in CrashLoopBackOff with logs showing failed to initialize system namespace or unable to establish connection to SQL database. ExternalSecret may show SecretSyncedError condition.
kubectl get externalsecret postgresql-cluster-app -n temporal -o jsonpath='{.status.conditions[*]}' | jq .
kubectl get secret postgresql-cluster-app -n temporal -o jsonpath='{.data.host}' | base64 -d
kubectl get secret postgresql-cluster-app -n temporal -o jsonpath='{.data.password}' | base64 -d | head -c5; echo '...'
kubectl run pg-check --rm -it --image=postgres:16 -n temporal -- psql postgresql://app@postgresql-cluster-rw.cnpg-system.svc.cluster.local:5432/temporal -c 'SELECT 1;'
kubectl logs deployment/temporal-server-frontend -n temporal --tail=50 | grep -i 'persistence\|connection\|sql'
kubectl get cluster postgresql-cluster -n cnpg-system -o jsonpath='{.status.phase}'
History service OOMKilled under workflow load¶
Symptoms: History pods restarting with OOMKilled exit reason. kubectl top pods -n temporal shows history pods approaching memory limits. Workflow tasks timing out or returning RESOURCE_EXHAUSTED errors to workers.
kubectl get pods -n temporal -l app.kubernetes.io/component=history -o wide
kubectl describe pod -n temporal -l app.kubernetes.io/component=history | grep -A5 'Last State\|Limits\|Requests'
kubectl top pods -n temporal -l app.kubernetes.io/component=history
kubectl logs -n temporal -l app.kubernetes.io/component=history --previous --tail=100 | grep -i 'memory\|oom\|cache'
# Check workflow history event counts (large histories consume history service memory):
kubectl exec -it postgresql-cluster-1 -n cnpg-system -- psql -U app -d temporal -c "SELECT workflow_id, COUNT(*) as event_count FROM executions GROUP BY workflow_id ORDER BY event_count DESC LIMIT 10;"
ExternalSecret not syncing credentials¶
Symptoms: Secret postgresql-cluster-app missing or stale in the temporal namespace. ExternalSecret status shows SecretSyncedError or ready: false. Temporal pods fail to authenticate to PostgreSQL.
kubectl get externalsecret postgresql-cluster-app -n temporal
kubectl describe externalsecret postgresql-cluster-app -n temporal | grep -A10 'Status:'
kubectl get clustersecretstore localstack-secretstore -o jsonpath='{.status.conditions[*]}' | jq .
kubectl logs -n external-secrets deployment/external-secrets --tail=50 | grep -i 'temporal\|postgresql-cluster-app'
# Verify secret exists in LocalStack:
kubectl exec -n localstack deployment/localstack -- awslocal secretsmanager get-secret-value --secret-id cnpg/postgresql-cluster-app/username --query SecretString --output text
# Force resync:
kubectl annotate externalsecret postgresql-cluster-app -n temporal force-sync=$(date +%s) --overwrite
Temporal Web UI unreachable via IngressRoute¶
Symptoms: Browsing http://temporal.local returns 404 or connection refused. Other IngressRoutes on the same Traefik instance work correctly. Temporal server pods are running and healthy.
kubectl get ingressroute temporal-web -n temporal -o yaml | grep -A10 'routes:'
kubectl get svc temporal-server-web -n temporal
kubectl get endpoints temporal-server-web -n temporal
kubectl port-forward svc/temporal-server-web -n temporal 8080:8080 &
curl -s -o /dev/null -w '%{http_code}' http://localhost:8080
# If port-forward works but IngressRoute doesn't, check Traefik routing:
kubectl logs -n traefik deployment/traefik --tail=100 | grep -i 'temporal'
Task queue backlog growing with no workers processing¶
Symptoms: Temporal Web UI shows increasing pending task count on task queues. Worker pods are running but not polling. Application workers report context deadline exceeded or server is not accepting new requests.
kubectl get pods -n temporal -l app.kubernetes.io/component=matching
kubectl logs -n temporal -l app.kubernetes.io/component=matching --tail=50 | grep -i 'error\|queue\|dispatch'
kubectl logs -n temporal -l app.kubernetes.io/component=frontend --tail=50 | grep -i 'rate\|limit\|reject'
# Check if frontend is reachable from within the cluster:
kubectl run grpc-check --rm -it --image=fullstorydev/grpcurl -n temporal -- -plaintext temporal-server-frontend.temporal.svc.cluster.local:7233 temporal.api.workflowservice.v1.WorkflowService/GetSystemInfo
# Verify matching service can reach history:
kubectl logs -n temporal -l app.kubernetes.io/component=matching --tail=100 | grep -i 'history\|unavailable\|connection'
Related¶
apps/base/temporal/— Kubernetes manifestsbase/services/temporal.yaml— Flux Kustomizationbase/services/environment.env— environment variables
Generated from service-catalog.json at commit 165b485 · catalog sha 4d088b0b3a67b4c4