kagent¶
kagent (GitHub) is a CNCF Sandbox project that brings AI agent lifecycle management to Kubernetes as a first-class concern. Rather than treating agents as application-layer abstractions managed by Python frameworks (LangGraph, CrewAI, AutoGen), kagent defines agents, model configurations, and tool servers as Custom Resources — reconciled by a controller that handles instantiation, health, context compaction, and inter-agent communication natively within the Kubernetes control plane.
The platform's core CRDs are Agent (declarative agent specification including model, tools, system prompt, and A2A skills), ModelConfig (LLM provider settings — model name, context window, temperature, provider endpoint), and RemoteMCPServer (tool endpoint registration using the Model Context Protocol). The kagent-controller watches these CRDs and manages agent runtime lifecycle, while the optional kagent-ui provides a conversation interface for human operators.
What distinguishes kagent from general-purpose agent frameworks: agents are Kubernetes-native objects subject to standard GitOps workflows (Flux/ArgoCD), RBAC, network policies, and observability pipelines. There is no out-of-band agent registry or framework-specific state store — the cluster API server is the source of truth. This enables infrastructure teams to manage AI agents with the same tooling they use for any other workload: kubectl, Helm, Kustomize, and policy engines.
kagent supports the A2A (Agent-to-Agent) protocol for inter-agent delegation and MCP (Model Context Protocol) for tool integration, making it composable with ecosystem tooling like AgentGateway for traffic management, policy enforcement, and observability across agent-to-agent and agent-to-tool boundaries.
Overview¶
| Property | Value |
|---|---|
| Namespace | kagent |
| Type | Kustomization |
| Layer | AI agent platform |
| Status | Enabled |
| Source | apps/base/kagent/ |
Dependencies¶
Upstream — required before kagent starts¶
| Service | Reason | Status |
|---|---|---|
ollama |
Flux dependsOn |
Active |
cnpg-operator |
Flux dependsOn |
Active |
external-secrets-config |
Flux dependsOn |
Active |
traefik |
Flux dependsOn |
Active |
grafana-sa-setup |
Flux dependsOn |
Active |
Downstream — services that depend on kagent¶
| Service | Dependency type | Reason |
|---|---|---|
agentgateway |
Flux dependsOn |
Requires kagent |
Purpose¶
kagent is the platform's AI operations layer — a multi-agent system that provides natural-language access to cluster state, observability data, cost analysis, and GitOps-driven infrastructure changes. It replaces ad-hoc kubectl/Helm/Grafana context-switching with a single conversational interface backed by specialized agents, each scoped to a narrow operational domain with least-privilege tool access.
Concretely, kagent powers: cluster diagnostics (pod status, events, logs, Flux sync state), fleet-wide observability (cross-cluster Prometheus/Loki aggregation), cost analysis and right-sizing (OpenCost integration), automated incident response (Alertmanager → triage → investigation → diagnosis → reporting pipeline), and GitOps change proposals (draft PRs for infrastructure modifications). All write operations flow through Git — no agent applies changes directly to the cluster.
Why kagent over LangGraph/CrewAI/AutoGen: Those frameworks run agents as application processes with framework-specific state, requiring custom deployment, scaling, and monitoring infrastructure. kagent's CRD model means agents are reconciled by the same Flux pipeline that manages every other workload — no separate agent orchestration layer to maintain. Agent definitions live in Git alongside the infrastructure they operate on, enabling atomic rollback of both the agent configuration and the infrastructure it manages.
Why not a single monolithic agent: A single agent with all tools loaded simultaneously causes context window bloat (30+ tools confuse routing), creates security risk (one prompt injection accesses all tools), and prevents model-size optimization (planning tasks need larger models than simple tool-calling). The orchestrator-worker split allows the coordinator to use a high-quality reasoning model while workers use faster, smaller models optimized for tool execution.
Features¶
| Feature | Detail |
|---|---|
| Orchestrator-worker multi-agent architecture | coordinator-agent routes queries to 5 specialist workers (cluster-agent, observability-agent, git-agent, finops-agent, code-agent) using keyword-deterministic rules — not LLM intent classification. A separate incident-orchestrator sequences a 3-stage pipeline (investigation → diagnosis → reporting) for automated incident response. All agents are declared as Agent CRs with explicit tool allowlists and A2A skill advertisements. |
| Go runtime with model warming | All agents use runtime: go for ~2s cold-start (vs ~15s Python default). A CronJob (ollama-model-warmer) pings Ollama every 10 minutes with keep_alive=15m to keep both qwen2.5:72b and qwen2.5:14b-kagent resident in VRAM, reducing warm-start TTFR from ~90s to ~10-15s for the large model. |
| Tool federation via AgentGateway | RemoteMCPServer CRs point at AgentGateway proxy endpoints (:9090/mcp/k8s, :9090/mcp/helm) which fan out to per-cluster backend MCP servers. Tools are returned with cluster-prefixed names (e.g. k8s-tools-services-amer_k8s_get_resources), and each agent's toolNames allowlist filters to only its permitted subset. This enables multi-cluster tool routing without per-agent endpoint configuration. |
| Multi-tier input guardrail chain | Requests pass through 4 defense layers before reaching agents: (1) Traefik forwardAuth to input-guardrail — a stateless Python regex engine matching credential leaks, prompt injection, and jailbreak patterns in <5ms p99; (2) basicAuth identity gate via ExternalSecret-managed htpasswd; (3) coordinator-agent's prompt-level refusal backstop; (4) AgentGateway A2A policy for inter-agent injection on the A2A path. |
| Event-driven incident response pipeline | alertmanager-hook receives Alertmanager webhooks and writes critical alerts to a Redis Sentinel stream (incident:events, DB 4). KEDA scales stream-dispatcher replicas (1–5) based on pending entry count in consumer group dispatch-group. The dispatcher invokes triage-agent via A2A, which classifies/deduplicates alerts and escalates confirmed incidents to incident-orchestrator for the full investigation → diagnosis → reporting pipeline. |
| Context compaction with per-agent token budgets | Each agent declares a tokenThreshold (coordinator: 12K, cluster-agent: 24K, fleet agents: 16K) that triggers automatic conversation compaction. A summarizer using fast-model-config compresses older context while retaining the most recent N events (eventRetentionSize). This prevents context window exhaustion during long diagnostic sessions without losing critical state. |
| GitOps-only write enforcement with approval gates | git-agent is the sole mutation path — it reads current state, creates a feature branch (agent/*), commits changes, and opens a draft PR. Write tools (create_branch, create_or_update_file, push_files, create_pull_request) require explicit human approval via requireApproval. The merge_pull_request tool is intentionally excluded from the agent's toolset — humans always merge. |
| Tiered model configuration | Two ModelConfig CRs provide model-appropriate sizing: default-model-config (qwen2.5:72b, 16K context, used by coordinator and git-agent for complex reasoning) and fast-model-config (qwen2.5:14b-kagent, 8K context, used by all tool-calling workers and as the compaction summarizer). Temperature is pinned at 0.1 for deterministic tool-calling behavior. |
| Prometheus operational alerting | PrometheusRule kagent-operational defines 5 alerts: KagentOllamaRequestSurge (>10 invocations/min sustained 5m), KagentAgentLoopDetected (>5 invocations in 60s — critical), KagentEdgeRejectionSurge (>5 forwardAuth 403s/min), KagentAgentToolCountExceeded (>18 tools per agent — god-agent anti-pattern), and KagentLLMTokenBudgetExceeded (>1M tokens/hour through agentgateway). |
| OpenTelemetry distributed tracing | kagent-controller exports traces via OTLP/gRPC to the cluster's OpenTelemetry Collector (opentelemetry-collector.opentelemetry.svc.cluster.local:4317). Environment variables OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES are set per-component (controller, stream-dispatcher, alertmanager-hook) for service-level trace attribution in Jaeger. |
| NetworkPolicy micro-segmentation | Seven NetworkPolicy resources enforce least-privilege network access: intra-namespace communication, ingress from agentgateway-system (ports 8083/8084/80), traefik (UI access), and monitoring (Prometheus scraping); egress to agentgateway-system (MCP federation), infrastructure namespaces (cnpg-system, redis-sentinel, monitoring, opentelemetry), and kube-system DNS (UDP/TCP 53). |
Architecture¶
kagent Namespace Deployment Topology¶
graph TD
subgraph traefik-ns["traefik namespace"]
traefik[Traefik Ingress]
end
subgraph kagent-ns["kagent namespace"]
ingressroute["IngressRoute<br/>kagent.local:80"]
guardrail["input-guardrail<br/>:9292 forwardAuth"]
auth["kagent-auth<br/>basicAuth middleware"]
ui["kagent-ui<br/>:8080"]
controller["kagent-controller<br/>:8083 A2A / :8084 MCP"]
github-mcp["github-mcp-server<br/>:8082 Streamable HTTP"]
hook["alertmanager-hook<br/>:8080"]
dispatcher["stream-dispatcher<br/>KEDA 1-5 replicas"]
warmer["ollama-model-warmer<br/>CronJob */10"]
end
subgraph agentgateway-ns["agentgateway-system namespace"]
agw["agentgateway-proxy<br/>:8080 A2A / :9090 MCP"]
end
subgraph ollama-ns["ollama namespace"]
ollama["Ollama<br/>:11434"]
end
subgraph cnpg-ns["cnpg-system namespace"]
pg["postgresql-cluster<br/>kagent DB + pgvector"]
end
subgraph redis-ns["redis-sentinel namespace"]
redis["Redis Sentinel<br/>:26379"]
end
subgraph monitoring-ns["monitoring namespace"]
prom["Prometheus<br/>PodMonitor scrape"]
otel["OTel Collector<br/>:4317 OTLP/gRPC"]
end
traefik -->|"Host: kagent.local"| ingressroute
ingressroute -->|"forwardAuth"| guardrail
ingressroute -->|"basicAuth"| auth
ingressroute -->|":8080"| ui
controller -->|"OTLP/gRPC :4317"| otel
controller -->|"Ollama API :11434"| ollama
controller -->|"PostgreSQL :5432"| pg
controller -->|"appProtocol: kgateway.dev/a2a :8083"| agw
agw -->|"A2A :8083"| controller
agw -->|"MCP :8082"| github-mcp
dispatcher -->|"A2A :8080"| agw
dispatcher -->|"Sentinel :26379"| redis
hook -->|"XADD incident:events"| redis
warmer -->|"POST /api/generate"| ollama
prom -->|"scrape /metrics :8083"| controller
Multi-Agent Request Flow¶
sequenceDiagram
participant User
participant Traefik
participant Guard as input-guardrail
participant UI as kagent-ui
participant Ctrl as kagent-controller
participant AGW as agentgateway-proxy
participant Coord as coordinator-agent
participant Worker as specialist worker
User->>Traefik: POST kagent.local
Traefik->>Guard: forwardAuth (regex check)
Guard-->>Traefik: 200 OK (pass)
Traefik->>UI: proxy :8080
UI->>Ctrl: A2A invoke coordinator-agent
Ctrl->>AGW: A2A request :8080
AGW->>Coord: route to coordinator
Note over Coord: Keyword-deterministic routing
Coord->>AGW: delegate to worker (A2A)
AGW->>Worker: route to specialist
Worker->>AGW: MCP tool call :9090
AGW-->>Worker: tool response
Worker-->>AGW: structured envelope
AGW-->>Coord: worker response
Note over Coord: Render envelope to markdown
Coord-->>AGW: final response
AGW-->>Ctrl: A2A response
Ctrl-->>UI: display to user
UI-->>User: rendered answer
Configuration¶
All values sourced from base/services/environment.env
(base); per-environment overrides in clusters/stages/dev/.../environment.env.
| Parameter | Dev | Prod |
|---|---|---|
KAGENT_CHART_VERSION |
0.9.5 |
0.9.5 |
KAGENT_CONTROLLER_CPU_LIMIT |
500m |
1000m |
KAGENT_CONTROLLER_CPU_REQUEST |
100m |
100m |
KAGENT_CONTROLLER_MEMORY_LIMIT |
2Gi |
512Mi |
KAGENT_CONTROLLER_MEMORY_REQUEST |
256Mi |
128Mi |
KAGENT_CONTROLLER_REPLICAS |
1 |
1 |
KAGENT_UI_CPU_LIMIT |
250m |
500m |
KAGENT_UI_CPU_REQUEST |
100m |
100m |
KAGENT_UI_MEMORY_LIMIT |
512Mi |
1Gi |
KAGENT_UI_MEMORY_REQUEST |
256Mi |
256Mi |
KAGENT_UI_REPLICAS |
1 |
1 |
Operations¶
Related¶
apps/base/kagent/— Kubernetes manifestsbase/services/kagent.yaml— Flux Kustomizationbase/services/environment.env— environment variables
Generated from service-catalog.json at commit 165b485 · catalog sha 4d088b0b3a67b4c4