Migrating EKS Nodes to Bottlerocket: Architecture, Operations, and Lessons Learned¶
After migrating multiple EKS clusters from Amazon Linux 2 to Bottlerocket, I want to share the architecture differences, what changed operationally, the EBS snapshot workflow we built for image caching, and the production challenges we hit along the way.
Why Bottlerocket?¶
Amazon Linux 2 (AL2) is a general-purpose Linux distribution — it ships with a package manager, SSH, and a full userland. That flexibility comes at a cost: larger attack surface, mutable state, and operational patterns (SSH + bash scripts) that don't align well with Kubernetes-native workflows.
Bottlerocket is AWS's container-optimized OS, purpose-built to run containers and nothing else. The migration was driven by security posture, operational simplicity, and node provisioning speed.
Architecture Deep Dive¶
Understanding Bottlerocket's architecture explains why the migration was worth the effort.
Container-Optimized Minimal OS¶
Bottlerocket ships exactly three things: containerd, kubelet, and minimal system services. That's it.
- No package manager (no
yum, noapt) - No SSH by default
- No user-installed software
This results in roughly 80% smaller attack surface compared to a general-purpose Linux distribution.
Immutable Root Filesystem¶
System components are read-only at runtime. You cannot modify them. Configuration happens through declarative TOML files, not imperative shell scripts. Any changes require a full image update — not in-place package installs.
This is a fundamental shift from how most teams operate with AL2.
Dual-Disk Architecture¶
Bottlerocket separates the OS and data onto two independent disks:
graph LR
subgraph "Disk 1: OS Volume (/dev/xvda)"
direction TB
A["Read-only root filesystem<br/>~20 GB"]
end
subgraph "Disk 2: Data Volume (/dev/xvdb)"
direction TB
B["/var/lib/containerd<br/>/var/log<br/>50–100 GB"]
end
The key benefit: container storage is completely isolated from the OS and survives OS updates. Your running containers and their data are unaffected when the OS partition is updated.
A/B Partition Updates¶
The OS disk has two partitions — Partition A and Partition B. One is active, the other is passive.
sequenceDiagram
participant Active as Partition A (Active)
participant Passive as Partition B (Passive)
participant Boot as Bootloader
Note over Active: Running current OS version
Active->>Passive: Write update to passive partition
Boot->>Passive: Reboot → swap to Partition B
alt Boot succeeds
Note over Passive: Partition B is now Active
else Boot fails
Boot->>Active: Automatic rollback to Partition A
end
When an update arrives:
- It writes to the passive partition
- On reboot, an atomic swap makes the passive partition active
- If the boot fails, it automatically rolls back to the previous partition
Zero-downtime updates with a built-in safety net.
Security Enforcement¶
- SELinux in enforcing mode by default (AL2 typically runs permissive)
- IMDSv2 required — blocks SSRF attacks on the metadata service
- Fewer binaries overall → fewer CVEs to track and easier compliance audits
Kubernetes-Native Operation Model¶
There's no more "SSH into the node and fix things." Everything flows through Kubernetes:
- Configuration → Kubernetes user data (TOML)
- Debugging → ephemeral admin containers via
kubectl - Monitoring → metrics and logs, not shell access
The Admin Container: Debugging Without SSH¶
Since there's no SSH, how do you debug node-level issues? Bottlerocket provides an admin container — a special privileged container you temporarily deploy on a node for low-level access.
How It Works¶
# Launch an admin container on a specific node
kubectl debug node/<node-name> -it --image=public.ecr.aws/bottlerocket/bottlerocket-admin:latest \
--profile=sysadmin
# Inside the admin container, useful commands:
sheltie # Enter the host namespace
journalctl -u kubelet # Check kubelet logs
crictl ps # List running containers
mount | grep xvd # Inspect disk mounts
cat /etc/bottlerocket/config.toml # View node configuration
Key characteristics:
- Runs with elevated privileges (
--profile=sysadmin) - Mounts the host filesystem for full node inspection
- Ephemeral — you delete it when done
For most day-to-day operations (checking logs, viewing metrics, inspecting resources), standard Kubernetes tooling like kubectl top node or your observability stack is still the recommended approach. The admin container is for when you need to go deeper.
Operational Impact: What Changed¶
Here's a concrete comparison of how daily operations shifted:
| Area | Amazon Linux 2 | Bottlerocket |
|---|---|---|
| Node debugging | SSH + shell commands | Ephemeral admin container via kubectl debug node |
| Config changes | Bash scripts applied via SSH | TOML config via Kubernetes user data |
| OS patching | Scheduled maintenance window (yum update + reboot) |
Atomic A/B partition update, zero downtime |
| AMI management | Custom AMIs with pre-pulled images (Packer builds) | AWS base AMI, no customization needed |
| Image caching | Bake container images into custom AMIs | Attach EBS snapshots with cached images |
| Multi-region | Copy custom AMIs across regions | Replicate EBS snapshots across regions |
| GPU drivers | Manual install and update | Built into AWS Bottlerocket NVIDIA AMI variants |
| Node provisioning | ~4 minutes to join cluster | ~30 seconds (87% reduction) |
The provisioning speed improvement was immediately noticeable during migration. For Karpenter-based autoscaling, this means much faster response to workload spikes and less time waiting for capacity.
EBS Snapshot Workflow for Image Caching¶
This was a significant part of the migration effort. The problem: when a pod gets scheduled on a fresh node without cached images, pulling large container images can take 40+ minutes — unacceptable for user-facing workloads.
The Old Approach (AL2)¶
With AL2, we maintained custom AMIs with pre-pulled images:
- A CI pipeline used HashiCorp Packer to build custom AMIs
- Separate AMIs for CPU and GPU across regions
- Every time new container images were released, rebuild all AMIs and replicate them across regions
- AMI IDs were hardcoded in infrastructure config and manually updated
This was a lot of maintenance overhead.
The New Approach (Bottlerocket)¶
With Bottlerocket, we use AWS's base AMI (no customization) and manage EBS snapshots for image caching:
flowchart TD
A["New container image released"] --> B["CI workflow triggered"]
B --> C["Provision temporary<br/>CPU + GPU instances"]
C --> D["Pull container images<br/>onto data volumes"]
D --> E["Create EBS snapshots<br/>of data volumes"]
E --> F["Store snapshot IDs<br/>in SSM Parameter Store"]
F --> G["Karpenter EC2NodeClass<br/>references latest snapshot ID"]
G --> H["New nodes boot with<br/>cached images attached"]
How it works:
- When new container images are released, automation provisions temporary CPU and GPU instances
- Images are pulled onto those instances' data volumes
- EBS snapshots are created from those data volumes
- Snapshot IDs are saved to SSM Parameter Store, so the node provisioner (Karpenter) always references the latest snapshot automatically
- In the EC2NodeClass
blockDeviceMappings, we reference those snapshot IDs - When Karpenter provisions a node, it uses the base Bottlerocket AMI and attaches the snapshot as the data volume — the node boots with all images already cached
This is AWS's officially recommended method. Packer doesn't even support Bottlerocket AMI customization. It's a much cleaner architecture with significantly better maintainability.
EKS Cluster Configuration (Terraform)¶
Before configuring Karpenter, the base EKS cluster itself needs to run Bottlerocket. The EKS managed node group that runs cluster-critical components (CoreDNS, Karpenter itself, etc.) must also be migrated — these nodes aren't managed by Karpenter, they're managed by the EKS control plane via the Terraform EKS module.
If you're using the popular terraform-aws-modules/eks module, the key change is setting ami_type to BOTTLEROCKET_x86_64 (or BOTTLEROCKET_ARM_64) in your managed node group:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "my-cluster"
cluster_version = "1.31"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_endpoint_public_access = true
eks_managed_node_groups = {
# System node group — runs Karpenter, CoreDNS, etc.
system = {
# Bottlerocket AMI type instead of AL2
ami_type = "BOTTLEROCKET_x86_64"
instance_types = ["m6i.xlarge"]
min_size = 2
max_size = 4
desired_size = 2
labels = {
"node-role" = "system"
}
taints = {
system = {
key = "CriticalAddonsOnly"
value = "true"
effect = "NO_SCHEDULE"
}
}
# IMDSv2 settings — hop limit must be 2 for containerized workloads
# (containers add an extra network hop; default of 1 blocks pod IMDS access)
metadata_options = {
http_endpoint = "enabled"
http_tokens = "required"
http_put_response_hop_limit = 2
}
# Bottlerocket uses TOML for node configuration (not bash scripts)
# This replaces the AL2 pattern of pre/post bootstrap user data
bootstrap_extra_args = <<-EOT
[settings.pki.custom-ca]
trusted = true
data = "-----BEGIN CERTIFICATE-----\nMIIBxTCCAWugAwIBAgIUZ3M...your-base64-cert-data...==\n-----END CERTIFICATE-----"
EOT
}
}
}
Two separate layers
Don't confuse the Terraform EKS module's managed node groups with Karpenter's NodePools. They serve different purposes:
- Terraform EKS managed node groups → bootstrap nodes that run cluster infrastructure (Karpenter, CoreDNS, kube-proxy). These exist before Karpenter is even installed.
- Karpenter NodePools + EC2NodeClasses → dynamically provisioned nodes for application workloads. Karpenter manages their lifecycle.
Both need to be configured for Bottlerocket independently.
Karpenter Configuration¶
With the base cluster running Bottlerocket, now configure Karpenter to provision application workload nodes on Bottlerocket as well. This is where the EBS snapshot caching comes in.
EC2NodeClass — defines the node template, including the Bottlerocket AMI family and EBS snapshot-backed data volume:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: "KarpenterNodeRole-my-cluster"
amiFamily: Bottlerocket
# IMDSv2 — hop limit must be 2 for containerized workloads
metadataOptions:
httpEndpoint: enabled
httpTokens: required
httpPutResponseHopLimit: 2
# Bottlerocket user data is TOML, not bash
userData: |
[settings.pki.custom-ca]
trusted = true
data = "-----BEGIN CERTIFICATE-----\nMIIBxTCCAWugAwIBAgIUZ3M...your-base64-cert-data...==\n-----END CERTIFICATE-----"
[settings.kubernetes]
max-pods = 110
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
blockDeviceMappings:
# Disk 1: OS volume
- deviceName: /dev/xvda
ebs:
volumeSize: 20Gi
volumeType: gp3
encrypted: true
# Disk 2: Data volume — attach EBS snapshot with cached images
- deviceName: /dev/xvdb
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
# Reference the snapshot containing pre-pulled container images
snapshotID: "snap-0123456789abcdef0"
tags:
environment: production
managed-by: karpenter
NodePool — defines scheduling constraints, instance types, and scaling behavior:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
workload-type: general
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
# Karpenter auto-expires nodes for rolling updates
expireAfter: 720h
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
# Cluster-wide resource limits
limits:
cpu: "1000"
memory: 1000Gi
For GPU workloads, use a separate NodePool and EC2NodeClass with amiFamily: Bottlerocket and NVIDIA-variant instance types:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu
spec:
role: "KarpenterNodeRole-my-cluster"
# Bottlerocket NVIDIA variants have built-in GPU support
amiFamily: Bottlerocket
# IMDSv2 — hop limit must be 2 for containerized workloads
metadataOptions:
httpEndpoint: enabled
httpTokens: required
httpPutResponseHopLimit: 2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 20Gi
volumeType: gp3
encrypted: true
- deviceName: /dev/xvdb
ebs:
volumeSize: 200Gi
volumeType: gp3
encrypted: true
# GPU workloads typically have larger images
snapshotID: "snap-0abcdef1234567890"
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
template:
metadata:
labels:
workload-type: gpu
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["g", "p"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
taints:
- key: nvidia.com/gpu
effect: NoSchedule
expireAfter: 720h
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
limits:
cpu: "200"
memory: 800Gi
Tip
Store snapshot IDs in SSM Parameter Store and reference them dynamically in your infrastructure-as-code (e.g., Terraform or Helm values) rather than hardcoding them. This way, when the CI pipeline creates new snapshots, Karpenter picks up the latest cached images automatically.
Note
This snapshot-based approach only applies to clusters that require large image caching. Standard service clusters don't need it.
Production Challenges and Lessons¶
Mass Image Pulls During Migration¶
During one cluster migration, all deployments tried to pull Docker images simultaneously when new Bottlerocket nodes came online. This created significant memory pressure on the nodes.
Resolution: We leveraged Karpenter's self-healing capabilities. For stateless service deployments that remain running unless manually stopped, we brought down all AL2 nodes at once and let pods get recreated on Bottlerocket nodes. For user-facing workloads that restart naturally on termination, we used a gradual rollout strategy. The entire process resolved in about 15 minutes as image pulls spread out over time.
What I'd do differently:
- Over-provision capacity by 30% during the migration window
- Automate batch draining with health checks between batches
cgroup Version Mismatch¶
AL2 uses cgroup v1; Bottlerocket uses cgroup v2. Applications that directly interacted with cgroup APIs broke after migration.
We had two options: downgrade Bottlerocket to cgroup v1 for backward compatibility, or keep v2 and update the application code.
Decision: Keep cgroup v2. Since v1 is being phased out across the Linux ecosystem, downgrading would create technical debt. We worked with the affected team to find a temporary workaround for their production issue while they implemented a permanent fix for v2 compatibility.
NVIDIA Device Plugin Conflict¶
On AL2, the NVIDIA device plugin for Kubernetes is manually installed to expose GPUs as schedulable resources. Bottlerocket's NVIDIA AMI variants have built-in GPU support.
When migrating GPU nodes, both the device plugin and Bottlerocket's native GPU support tried to expose GPUs — causing resource conflicts.
Temporary fix: Add a node selector to prevent the NVIDIA device plugin DaemonSet from running on Bottlerocket nodes:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os-variant
operator: NotIn
values:
- bottlerocket
Permanent fix: Once all GPU workloads are fully migrated to Bottlerocket, remove the NVIDIA device plugin entirely.
Key Takeaways¶
-
Immutability changes your operational model — no SSH, no imperative scripts. Everything flows through Kubernetes APIs and declarative config. This is a paradigm shift that pays off in consistency and auditability.
-
Dual-disk + A/B partitions = safe updates — container data survives OS updates, and failed updates automatically roll back. This eliminates an entire class of "patching gone wrong" incidents.
-
EBS snapshots > custom AMIs for image caching — cleaner, more maintainable, and AWS's recommended approach. The snapshot workflow integrates naturally with Karpenter and SSM Parameter Store.
-
Production migrations are about operational maturity — you can't plan for every edge case. What matters is the ability to triage quickly, mitigate impact, and circle back to do things properly. The cgroup mismatch and GPU plugin conflict were both discovered in production, and both were resolved without extended downtime.
-
Node provisioning speed matters more than you think — going from 4 minutes to 30 seconds fundamentally changes how autoscaling feels. Karpenter can respond to workload spikes almost immediately.