Ollama¶
Ollama (GitHub) is a local LLM inference runtime that packages model weights, quantization configs, and a serving layer into a single deployable unit. Unlike cloud-hosted inference APIs (OpenAI, Anthropic, Google), Ollama runs entirely within the cluster — no data leaves the network boundary, latency is bounded by local compute rather than internet round-trips, and there are no per-token costs after initial deployment.
What distinguishes Ollama from other self-hosted inference runtimes (vLLM, llama.cpp server, TGI): it provides an OpenAI-compatible API out of the box, handles model lifecycle (pull, cache, swap) automatically, supports quantized model formats (GGUF) for running large models on constrained hardware, and exposes a simple REST interface on a single port. The trade-off is less fine-grained control over batching, scheduling, and GPU memory management compared to production-grade serving frameworks — acceptable for a platform where inference volume is moderate and operational simplicity is valued over throughput optimization.
Overview¶
| Property | Value |
|---|---|
| Namespace | ollama |
| Type | HelmRelease (chart: ollama v1.53.0) |
| Layer | AI agent platform |
| Chart | ollama v1.53.0 |
| Status | Enabled |
| Source | apps/base/ollama/ |
Dependencies¶
Upstream — required before Ollama starts¶
No upstream Flux dependencies — starts immediately.
Downstream — services that depend on Ollama¶
| Service | Dependency type | Reason |
|---|---|---|
kagent |
Flux dependsOn |
Requires Ollama |
Purpose¶
Ollama is the platform's local LLM inference backend, serving as the model execution layer for the kagent multi-agent system. The coordinator-agent and its seven specialized subagents (k8s-agent, observability-agent, gitops-agent, flux-agent, helm-agent, security-agent, finops-agent) route all inference requests through Ollama to run planning, tool-use, and response generation workloads against locally-hosted models.
Traffic from kagent reaches Ollama through the Agent Gateway proxy rather than direct service-to-service calls, allowing centralized request routing, observability, and potential model-level traffic shaping without modifying individual agent configurations.
Why local Ollama over cloud inference APIs: The kagent orchestrator-worker architecture generates high volumes of inter-agent inference calls — the coordinator decomposes tasks and spawns multiple subagents in parallel, each making its own LLM calls. At cloud API pricing, this multi-hop pattern would produce unpredictable and potentially significant per-query costs. Running inference locally caps cost at fixed compute resources regardless of call volume. Additionally, the platform manages Kubernetes clusters containing potentially sensitive workload metadata — keeping all inference data within the cluster boundary eliminates data exfiltration concerns entirely.
Why Ollama over vLLM or TGI: This platform runs on CPU-only nodes without GPU acceleration. Ollama's GGUF quantization support and single-binary deployment model make it the simplest path to running large models (72B parameter) on CPU hardware with acceptable latency for an operations-focused agent system where response time tolerances are seconds, not milliseconds.
Features¶
| Feature | Detail |
|---|---|
| Automatic model pull at startup | Container entrypoint pulls the configured model on first boot, ensuring the pod is ready to serve inference immediately after scheduling without manual operator intervention or init containers. |
| Persistent model storage | A PersistentVolumeClaim retains downloaded model weights across pod restarts and rescheduling, avoiding multi-gigabyte re-downloads that would otherwise block readiness for several minutes on each pod lifecycle event. |
| ClusterIP-only exposure | The inference endpoint is accessible only within the cluster via ClusterIP on port 11434 — no ingress, no external exposure — limiting the attack surface to intra-cluster traffic routed through the agent gateway. |
| Install and upgrade remediation | HelmRelease configures 3 retries for both install and upgrade operations with extended timeouts, accommodating the slow startup inherent in pulling and loading multi-gigabyte model weights into memory. |
Architecture¶
Ollama Deployment Topology¶
graph TD
subgraph ollama-ns["Namespace: ollama"]
OPod["ollama pod<br/>(LLM inference runtime)"]
PVC["PersistentVolumeClaim<br/>(model weight storage)"]
SVC["Service: ollama<br/>ClusterIP :11434"]
end
subgraph agentgateway-system["Namespace: agentgateway-system"]
AGW["agentgateway-proxy<br/>:9080"]
end
subgraph flux-system["Namespace: flux-system"]
HR["HelmRelease: ollama"]
REPO["HelmRepository<br/>helm.otwld.com"]
CM["ConfigMap: cluster-vars"]
end
OPod -->|"mounts"| PVC
SVC -->|"routes to :11434"| OPod
AGW -->|":11434"| SVC
HR -->|"deploys"| OPod
HR -->|"pulls chart"| REPO
CM -->|"substitutes env vars"| HR
Inference Request Flow¶
sequenceDiagram
participant KAgent as kagent agents
participant AGW as agentgateway-proxy :9080
participant Ollama as ollama :11434
participant PVC as Model Storage (PVC)
Note over Ollama,PVC: Startup: model pull
Ollama->>PVC: Load model weights into memory
KAgent->>AGW: POST /api/generate (inference request)
AGW->>Ollama: Forward to :11434
Ollama->>Ollama: Run inference (CPU)
Ollama-->>AGW: Streaming response
AGW-->>KAgent: Forward tokens
Configuration¶
All values sourced from base/services/environment.env
(base); per-environment overrides in clusters/stages/dev/.../environment.env.
| Parameter | Dev | Prod |
|---|---|---|
OLLAMA_CHART_VERSION |
1.53.0 |
1.53.0 |
OLLAMA_CPU_LIMIT |
1000m |
2000m |
OLLAMA_CPU_REQUEST |
500m |
500m |
OLLAMA_HOST |
http://agentgateway-proxy.agentgateway-system.svc.cluster.local:9080 |
http://agentgateway-proxy.agentgateway-system.svc.cluster.local:9080 |
OLLAMA_MEMORY_LIMIT |
3Gi |
4Gi |
OLLAMA_MEMORY_REQUEST |
2Gi |
2Gi |
OLLAMA_MODEL |
qwen2.5:72b |
qwen2.5:72b |
OLLAMA_STORAGE_SIZE |
5Gi |
10Gi |
Operations¶
Model pull stalls during pod startup¶
Symptoms: Pod stays in Init or Running but never becomes Ready. Container logs show repeated pulling manifest or downloading lines with no progress. HelmRelease shows install timeout after 10 minutes, retries exhausting.
kubectl logs -n ollama deploy/ollama --follow | grep -i "pull\|download\|error"
kubectl describe pod -n ollama -l app.kubernetes.io/name=ollama | grep -A5 "Events"
kubectl exec -n ollama deploy/ollama -- wget -q --spider https://registry.ollama.ai && echo "Registry reachable" || echo "Registry blocked"
kubectl get events -n ollama --sort-by='.lastTimestamp' | tail -20
kubectl get pvc -n ollama -o wide
kubectl describe helmrelease ollama -n flux-system | grep -A10 "Status"
OOMKilled during model loading¶
Symptoms: Pod restarts with OOMKilled reason. kubectl describe pod shows last termination reason as OOMKilled. Model partially loads then process is killed by the kernel OOM killer. Repeated CrashLoopBackOff with increasing backoff intervals.
kubectl get pod -n ollama -l app.kubernetes.io/name=ollama -o jsonpath='{.items[0].status.containerStatuses[0].lastState.terminated.reason}'
kubectl top pod -n ollama
kubectl describe pod -n ollama -l app.kubernetes.io/name=ollama | grep -A3 "Last State"
kubectl logs -n ollama deploy/ollama --previous | tail -50
kubectl get configmap cluster-vars -n flux-system -o yaml | grep -i "OLLAMA_MEMORY"
PersistentVolumeClaim stuck in Pending¶
Symptoms: Pod cannot schedule — stuck in Pending state. Events show waiting for a volume to be created or no persistent volumes available. New deployment or cluster rebuild triggers this when the storage provisioner is not yet ready.
kubectl get pvc -n ollama
kubectl describe pvc -n ollama
kubectl get storageclass
kubectl get events -n ollama --field-selector reason=ProvisioningFailed
kubectl get pod -n ollama -o wide
Inference requests timing out through Agent Gateway¶
Symptoms: kagent agents report timeout errors or empty responses. Direct curl to ollama service succeeds but requests through agentgateway-proxy.agentgateway-system.svc.cluster.local:9080 fail. Agent Gateway logs show upstream timeout or connection refused.
kubectl exec -n ollama deploy/ollama -- curl -s http://localhost:11434/api/tags
kubectl run curl-test --rm -i --restart=Never --image=curlimages/curl -- curl -s -m 10 http://ollama.ollama.svc.cluster.local:11434/api/tags
kubectl run curl-test2 --rm -i --restart=Never --image=curlimages/curl -- curl -s -m 10 http://agentgateway-proxy.agentgateway-system.svc.cluster.local:9080/api/tags
kubectl logs -n agentgateway-system deploy/agentgateway-proxy --tail=50 | grep -i "ollama\|timeout\|error"
kubectl get endpoints ollama -n ollama
Model not available after pod reschedule¶
Symptoms: Pod restarts successfully but inference requests return model not found errors. The model pull did not trigger on restart, or the PVC data is corrupted/empty. Ollama API returns empty model list from /api/tags.
kubectl exec -n ollama deploy/ollama -- curl -s http://localhost:11434/api/tags
kubectl exec -n ollama deploy/ollama -- ls -la /root/.ollama/models/
kubectl get pvc -n ollama -o jsonpath='{.items[0].status.phase}'
kubectl logs -n ollama deploy/ollama | grep -i "pull\|model\|error"
kubectl rollout restart deployment/ollama -n ollama
Related¶
apps/base/ollama/— Kubernetes manifestsbase/services/ollama.yaml— Flux Kustomizationbase/services/environment.env— environment variables
Generated from service-catalog.json at commit 165b485 · catalog sha 4d088b0b3a67b4c4