AKS Node Image Retirement: Migration Strategy and Lifecycle Management

The Unpleasant Surprise

I finished a deployment at 4:47 PM on a Thursday a while back. The cluster was stable, the workloads were healthy, I wasn't thinking about my node image. No one ever is. Then my security team pinged me about a published CVE, and I checked my node status and realized the OS kernel version was no longer going to be supported. The image itself still booted and my pods still ran, but boy, I was now in a state of managed crisis that I didn't sign up for.

In my experience, platform teams almost always experience image retirement as a forced event rather than a predictable operating model. The retirement announcement arrives, the urgency is real, and the uncertainty is immediate: which pools do you migrate? When? What breaks? How long can you run two images side by side? Who owns the decision? Whatever, we'll figure it out as we go, right? :)

I've spent enough time rebuilding nodes and troubleshooting post-migration failures to know this problem isn't about the image itself. It's about the absence of an intentional operating practice around node image decisions and transitions. Every team I've worked with started by treating image retirement as a surprise, a few graduated to treating it as a compliance checkbox. The ones that actually got good at it started treating it as a structural part of cluster governance, where dependencies, test requirements, and version decisions become explicit choices with measurable consequences.

If you're running AKS right now, there's a concrete version of this staring at you: Azure Linux 2.0 hit end of life in November 2025, security patches stopped, and node images get removed entirely on March 31, 2026. The migration path is Azure Linux 3.0 (osSku AzureLinux3; verify the exact value against the current migration guide, as it may just be AzureLinux with the version selected automatically), and if you haven't started testing yet, the window is closing fast.

The goal of this article is to help you avoid that panic. Not to make image transitions easy (they aren't easy), but to make them deliberate, testable, observable, and grounded in real architectural consequences rather than panic scheduling.

What Image Retirement Actually Changes

An Azure Linux image retirement isn't a cluster unavailability event and it's not a Kubernetes breaking change. Your workloads don't suddenly stop running when an image reaches end of life, but a lot of quiet assumptions in your operational model are about to get expensive.

When Azure announces the retirement of a specific image version, here's what changes:

No new security patches. CVEs discovered after retirement won't be backported, which means kernel features and container runtime patches stop arriving and you're accumulating unpatched surface area every single day you stay on that image.
No Kubernetes upgrade path. New AKS versions may assume a minimum image vintage, which locks you out of those upgrades entirely.
CSI driver incompatibility. Storage drivers and platform services assume specific kernel features, so an aged image may not support new CSI versions.
CNI constraints. Container networking layers depend on specific kernel capabilities, which is why Cilium, Azure CNI, and kubenet all have minimum OS kernel versions that you need to track.
Addon incompatibility. Metrics agents, cluster autoscaler, and other daemonsets require specific libraries or kernel features that may not exist on an older image.
No new hardware optimizations. Newer images ship with kernel modules and firmware optimizations for newer VM SKUs, so an old image may not fully utilize a modern generation machine.

What doesn't change:

Existing workloads don't break instantly.
Your Kubernetes API version is unaffected.
Pods continue running until explicitly evicted.
The cluster stays operational as long as you maintain at least one supported node pool.
Pod specs, service definitions, and application logic need no changes.

The deception is that you can run a retired image for months and see no obvious problems. Cluster metrics stay stable, deployments roll normally, and it feels fine, right? But you're running in a constrained state where each week increases the surface area of unpatched CVEs, and you're increasingly locked out of platform upgrades that require newer images. Sure, it still works today, but this false stability later becomes a real urgency when migration begins and you discover dependencies you didn't know existed.

The Hidden Dependencies in the Stack

Before building a migration strategy, you need precision about what's independent and what's coupled. I've skipped this step before and regretted it, don't make the same mistake.

A Kubernetes cluster is basically a stack of versions: API version, node OS kernel, node image version, container runtime, OS-level libraries, CSI driver versions, CNI plugin versions, daemonsets and addons, and workload assumptions. Not all of these move together, and that's where migrations get tricky.

The key decoupling: Kubernetes version is independent of node image version (new versions support the last 3-4 image versions, old versions may not support brand new images). But container runtime, kernel version, and kubelet are all pinned to the image, so you accept the complete stack when you pick an image. CSI drivers and Azure CNI are loosely coupled but depend on specific kernel capabilities and systemd updates. These dependencies are documented but often discovered during testing (ask me how I know). Daemonsets are coupled to Kubernetes version but may require libraries that only exist in newer images. Most containerized workloads are portable, though some depend on kernel features like seccomp behaviors, cgroup v2, or NUMA awareness.

That being said, the practical consequence is clear: when you retire an image, you're not just picking a new SKU. You're verifying that your complete stack of Kubernetes version, image version, CSI drivers, CNI, daemonsets, and applications have a supported combination.

Inventory and Dependency Mapping

Before scheduling any node drains, inventory what's actually running and what it assumes. Alright, let's set it up.

Cluster inventory:

# Node pools and image versions
az aks nodepool list --resource-group <rg> --cluster-name <cluster>

# DaemonSets, CSI drivers, and custom agents
kubectl get daemonsets --all-namespaces
kubectl get csidriver
kubectl get pods --all-namespaces --field-selector=spec.hostNetwork=true

# CSI and CNI versions
az aks show --resource-group <rg> --name <cluster> --query networkProfile
az aks show --resource-group <rg> --name <cluster> --query storageProfile

# Kubernetes version
kubectl version

This tells you what's touching the kernel or host network directly, and everything else is likely portable.

Custom agents and dependencies:

Custom kernel modules (uncommon but critical if you have them)
Custom seccomp profiles that depend on specific kernel syscalls
Observability agents that depend on kernel features, things like Cilium Tetragon, Falco, or custom eBPF programs
Compliance or security scanning tools that assume specific libraries

Application workload profile:

What languages and runtimes are you running? (Go, Node.js, Java, Python, .NET)
Are there legacy applications that depend on specific glibc versions or kernel behaviors?
Do you use custom container images with pinned OS baselines?
How many data-intensive or compute-intensive workloads depend on kernel optimizations?

Target image validation:

Review the Azure Linux 3.0 release notes and actually read what changed in the kernel, libraries, and container runtime, don't just skim the changelog.
Check the AKS supported version matrix, which is the canonical reference. If your combination isn't listed, it's not supported.
Test CSI and CNI against the new image on a test pool, because you should never assume compatibility.
Test addon compatibility on the new image before full migration.

The output is a dependency map:

Cluster X, pool A: Azure Linux 2.0 (retiring March 2026), Kubernetes 1.28, Azure CNI 1.5.2, Azure Disk CSI 1.28.1, Cilium Tetragon, custom seccomp policies.
Cluster X, pool B: Ubuntu 22.04 (supported through 2027), Kubernetes 1.29, Azure CNI 1.5.4, Azure Files CSI 1.27.0, Fluent Bit.
Cluster Y, pool C: Azure Linux 2.0 (retiring March 2026), Kubernetes 1.29, Azure CNI 1.5.6, Azure Disk CSI 1.29.2.

From this map, you identify which migrations are simple (no custom agents, compatible Kubernetes versions) and which are complex (kernel module dependencies, CNI constraints, incompatibilities).

Migration Patterns

Alright, you've got two approaches here, which are in-place pool replacement and cluster migration. Here's the quick comparison before we dive in:

AKS node pools showing system, spot, and azlinux3 pools side by side for migration

	In-Place Pool Replacement	Cluster Migration
Downtime risk	Zero control plane downtime	Zero if traffic-switched properly
Cost overhead	2x node pool cost during migration	2x full cluster cost during migration
Complexity	Moderate (drain sequencing, taints)	High (two clusters, traffic routing)
Rollback speed	Fast (untaint old pool)	Slow (switch traffic back)
Best for	Single-cluster, tight availability	Multi-cluster, strict compliance
Kubernetes version change	Same version, new image only	Can change both simultaneously
Observability	Before/after on same cluster	Separate dashboards per cluster

Pattern 1: In-Place Node Pool Replacement (Blue-Green)

Create a new node pool with the updated image, drain the old pool, and delete the old pool, which is the most common pattern for a single cluster.

Advantages:

Zero control plane downtime, which is the big one for most teams
Workloads stay on the same cluster so you don't have to deal with kubeconfig juggling or DNS changes
Clear before/after observability since both pools share the same monitoring stack
Fastest human decision points
Works with autoscaling (though you'll want to tune the autoscaler behavior during migration, more on that later)

Disadvantages:

Temporary cost increase because you're running two pools in parallel
Requires careful node scheduling with affinity rules, taints, and tolerations
Workloads may not be evictable if not designed for rolling updates
Drain sequencing matters a lot, get it wrong and you end up with stuck pods

Mandatory for:

Single-cluster deployments
Clusters with regional failure domain constraints
Workloads with strict availability requirements

Pattern 2: Cluster Migration (Lift and Shift)

Create a new cluster with the updated image, migrate workloads, and then delete or orphan the old cluster. This is the approach for large migrations or when you need more control.

Advantages:

Complete separation of old and new stacks, which eliminates any chance of accidental scheduling onto stale nodes
Easier for complex migrations with offline testing since you can take your time validating the new cluster
Works well for multi-region deployments
Simpler observability during transition because each cluster has its own clean dashboard

Disadvantages:

Temporary 2x infrastructure cost, which adds up fast at scale
Must manage traffic routing between old and new clusters
More moving pieces (two clusters, two kubeconfigs, potentially two sets of secrets)
Longer recovery path if something breaks
Requires external traffic switching through a load balancer or DNS

Mandatory for:

Multi-cluster deployments where you can afford sequential migration
Environments with strict compliance or audit requirements
Cases where you want to test Kubernetes version changes in parallel

Hybrid: Canary Cluster with Traffic Mirroring

For conservative large environments, run old and new clusters in parallel, mirror traffic to new, and gradually shift weight. This trades temporary cost for maximum safety and is common in retail, fintech, and healthcare where downtime is unacceptable.

What Should You Actually Test?

You can't test everything, but you can test what matters, which is the kernel, the container runtime, persistent storage, networking, and daemonsets.

Kernel compatibility:

Run the new image on a small test pool and verify:

# Kernel syscall behavior
strace -e trace=<syscall> <app>

# Cgroup version (some old workloads expect cgroup v1)
cat /proc/self/cgroup

# Custom kernel modules load cleanly
modprobe <module>

# Network policies work with Cilium or Azure CNI
kubectl apply -f <test-policy>.yaml

Container runtime:

# SSH into test node and verify images work on the new OS
crictl pull <image>
# Run a test pod via kubectl to verify container behavior:
kubectl run img-test --image=<image> --restart=Never -- /bin/sh -c "echo ok"
# Verify stability, memory footprint, startup time

CSI and persistent volumes:

Create a test PVC and verify it provisions, mounts, and unmounts cleanly on the new image. Check node logs for mount errors. This is one of those things that usually works fine but when it doesn't, the failure mode is ugly, pods stuck in ContainerCreating with cryptic mount errors that don't tell you much.

CNI and network policy:

Deploy a test NetworkPolicy and verify traffic is allowed or denied as expected.

DaemonSet compatibility:

Deploy your actual daemonsets (metrics agent, cluster autoscaler, Cilium, etc.) on the test pool and verify they start cleanly:

kubectl get daemonsets --all-namespaces
kubectl describe node <test-node-name>
# Verify no pods in crash loops, no NotReady conditions

Workload testing:

Now, run actual production applications on the test pool for at least one full traffic cycle (24 hours). Observe:

CPU and memory usage against baseline
Startup time
Error rates (expect zero)
Disk I/O and network latency patterns

Rollback test:

Drain the test pool and verify:

Pods evict cleanly and reschedule on old image nodes
No stuck or unreschedulable pods
Traffic routes smoothly back to old pool
Cluster recovers to baseline

This is a dress rehearsal for production rollback, and if rollback is slow in test it'll be slower in production, don't skip this step.

Rollout Sequencing and Disruption Management

You now have a tested new image, and the next phase is actual replacement, which is where most migrations become chaos.

The core problem is that Kubernetes nodes have no atomic replace operation. Draining is a graceful shutdown that depends on workload design, so pods without disruption budgets are terminated instantly, pods with PDBs (Pod Disruption Budgets) wait, and stateful pods may not drain at all.

Step 1: Taint the old pool.

Prevent new workloads from scheduling to the old pool:

kubectl taint nodes --selector=nodepool=old-pool lifecycle=retiring:NoSchedule

Existing workloads stay on that pool until you explicitly evict them.

Step 2: Pre-position replicas on the new pool.

For deployments with multiple replicas, scale up slightly so some replicas are already running on the new pool:

kubectl scale deployment <name> --replicas=<current + 1 or 2>

This reduces traffic loss during draining.

Step 3: Drain sequentially.

Don't drain all nodes in parallel. Drain them one at a time, starting with non-critical nodes:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=300

The --grace-period=300 gives containers 5 minutes to shut down cleanly, which you should increase for workloads with long checkout phases.

Watch the drain output carefully because it'll list pods that can't be evicted, which usually means no PDB allows it or the pod isn't replicated. Investigate these. If they're truly singleton or stateful, you may need to manually evict or accept brief downtime.

Step 4: Monitor the drain in real time.

As pods evict, they reschedule on the new pool, and you need to watch whether the new pool has capacity, whether pods are stuck in Pending state due to insufficient resources, and whether daemonsets are crashing on new nodes due to incompatibility.

If any of these happens, stop the drain, fix the issue, and resume. Autoscaling should add new nodes if capacity is the bottleneck.

Step 5: Delete the drained node.

Once fully drained:

kubectl delete node <node-name>

Or let autoscaling delete it, and verify deletion in the Azure portal.

Step 6: Repeat for each node.

Once all old-pool nodes are drained and deleted, delete the pool:

az aks nodepool delete --resource-group <rg> --cluster-name <cluster> --name <old-pool-name>

Pod Disruption Budgets (PDBs):

PDBs slow rotation but prevent surprise downtime during drains:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Audit all deployments and add PDBs before large migrations. Use maxUnavailable: 1 for most services and minAvailable: N for stateful services. Without PDBs, every single-replica deployment is a gap where eviction means downtime.

Autoscaler interactions:

If you've got autoscaling enabled, the autoscaler gets confused during migration because as pods evict from the old pool it sees underutilized nodes and scales down, even though you're running two pools intentionally. Give the autoscaler guidance:

# Lower scale-down delay so empty old-pool nodes get removed faster
az aks update --resource-group <rg> --name <cluster> \
  --cluster-autoscaler-profile scale-down-delay-after-add=5m

Or just disable autoscaling during migration and manage pool size manually. I for one have found this is actually less stressful than fighting the autoscaler's opinions about what "underutilized" means during a migration window. Whatever the autoscaler thinks is optimal, it's probably wrong during a migration.

Maintenance windows:

Schedule the migration during low-traffic hours (midnight to 6 AM in your primary region) and outside release windows, and have the team monitor for the first 24 hours after completion.

How Do You Know When to Roll Back?

After you start draining you can't be blind. You need signals that tell you immediately if something's wrong.

Metric signals:

# Pod restart count spike
kubectl get pods --all-namespaces -o json | jq '.items[].status.containerStatuses[].restartCount' | sort | uniq -c

# Stuck pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Node health
kubectl get nodes

# DaemonSet pod failures
kubectl get daemonsets --all-namespaces -o wide

Log signals:

Kernel logs (dmesg, /var/log/kern.log): kernel panics or OOM events mean you stop the migration immediately
Container runtime logs (journalctl -u containerd): mount failures, image pull failures, and runtime errors which indicate image-level incompatibilities
Kubelet logs (journalctl -u kubelet): API server connectivity and resource allocation failures
Application logs: sample representative pods for errors that correlate with the migration timing

Alerting thresholds:

Any node transitions to NotReady state
Pod restart rate > 5 per node per hour
Pending pods count > 0 for more than 2 minutes
DaemonSet pod failures > 0
Application error rate increase > 10% compared to baseline

Rollback decision criteria:

Stop the migration and roll back if any alerting condition persists for more than 5 minutes, if a workload can't be drained because it's stuck in Terminating for longer than the grace period, or if manual testing confirms a hard incompatibility.

To roll back:

# Stop draining immediately
kubectl taint nodes --selector=nodepool=old-pool lifecycle-

Let the autoscaler re-balance workloads back to the old pool, investigate the root cause, and either choose a different image or fix the incompatibility and retrigger the migration.

The goal is simple: if you're going to roll back, do it within the first hour, because the longer you wait the more complex rollback becomes.

Who Owns What?

Image retirement isn't just a technical problem, it's a coordination problem across multiple teams. That being said, in my experience the coordination is usually harder than the technical migration itself, which is easy to underestimate until you're in the middle of it.

Responsibility model:

Platform team: Owns the decision to retire an image, testing on updated runtimes, and rollout sequencing, and communicates timeline to application teams.
Application team: Owns testing of their workload on the new image, which means identifying incompatibilities early and working with platform team to resolve them.
Security team: Owns the policy that triggers retirements (usually: no patch support means no deployment), reviews the migration plan, and signs off on timeline.
Compliance team: Owns the audit trail and verification that all clusters migrated on schedule.

Communication cadence:

Start pinging people early. Like 4-6 weeks before the migration, send the announcement with the target window and the dependency list. Ask app teams to actually test their workloads on a test pool (they won't do it immediately, but at least you have a paper trail :)). Around 2-4 weeks out, share what you've found in testing, publish the runbook, and start having the uncomfortable conversations about incompatibilities. A week or two before, you lock the window and get signoffs, and boy, getting those signoffs is sometimes harder than the actual migration. On the day itself, run it with full observability and do a retro after.

Escalation and approval:

Keep this simple: if alerts fire but you fix them within 30 minutes, just keep going. If you need app code changes, get the app team to approve. If you need to delay, loop in security. And if you need to roll back completely, tell management, which is awareness not approval, because at that point you're already rolling back whether they like it or not.

Living runbook:

Maintain a runbook that evolves:

A checklist of tests to run before migration, including the kernel, CSI, CNI, and daemonset checks from earlier
Step-by-step drain procedure with the exact commands and flags
Monitoring thresholds and the specific responses for each alert condition
Rollback procedure and decision criteria, written so someone who wasn't in the planning meetings can execute it
Post-migration validation steps covering both infrastructure health and application behavior
Lessons learned from previous migrations (this section grows the most over time)

Update after each migration. The first migration teaches you how to run the second.

The Real Cost of Migration

Now for the most commonly avoided topic, which is the cost of running two node pools in parallel.

Direct infrastructure cost:

Running two pools at 100% capacity doubles the daily cost. So, a cluster costing $2,000/day costs $4,000/day during migration, which means a 7-day migration adds $14,000 in incremental cost per cluster and at scale (10 clusters) that's $140,000.

This isn't optional and there's no way around it.

Ways to reduce cost:

Stagger migrations across clusters. Instead of migrating all clusters at once, migrate one every week or two, which spreads cost over time and also lets you apply lessons from the first migration to the next one.
Reduce old-pool sizing during migration. As you drain it, its utilization drops, so you can auto-scale it down or manually delete mostly-empty nodes to stop paying for idle capacity.
Use burstable VMs for test pools. When first testing the new image, use burstable or low-priority SKUs since you don't need production-grade compute for compatibility testing (though don't use them for the actual performance validation step, that needs real hardware).
Consolidate workloads before migration. Merge underutilized clusters before image lifecycle forces migration.
Run migrations off-hours. If clusters are elastic, scale down completely after migration.

Real cost calculation:

Don't think about it as "extra cost for a week." Think about it as a percentage of annual infrastructure spend, because if annual cluster cost is $1M and migrations cost $200K/year (spread across 4-5 migrations), that's a 20% tax on infrastructure. It's not an anomaly but a line item that belongs in budget planning from day one.

Better: build this into image lifecycle planning, because if you retire images every three years you're doing roughly 15-20 migrations per year at scale, which is predictable cost that you should account for.

Anti-Patterns That Make Migrations Painful

Single-replica deployments without PDBs.

A single-replica deployment has no fault tolerance. When you drain the node the pod is evicted and there's a gap before a new instance starts on the new pool, which for customers is downtime, full stop.

Fix: before migration, audit your deployments, scale all critical services to at least 2 replicas, and add a PDB to maintain at least one available.

Testing only on a tiny test pool.

Some teams create a single test node, verify it works, and assume readiness. But a single node doesn't reveal scheduling constraints, resource contention, or autoscaler interactions, which is why the test is misleading.

Fix: test the new image on a pool that matches your actual node size and count. If you run 50-node pools in production, test on a 5-10 node pool.

Running drain and walking away.

Drain can get stuck if pods aren't evictable, and I've seen migrations stall for hours because no one watched and didn't realize a StatefulSet pod was stuck in Terminating state.

Fix: humans must observe the drain in real time. Run the drain in a terminal, watch the output, and intervene if something's stuck.

Draining all nodes in parallel.

If you delete the node pool before all nodes drain, the new pool may not have capacity, which means pods get stuck Pending and the migration stalls.

Fix: drain sequentially. One node, wait for it to fully empty, then move to the next, which takes longer in clock time but is more reliable and shows progress.

No monitoring, discovery during application incidents.

Running the migration late at night with no observability and then finding out the next morning that something's broken, which means that morning is now a production incident involving a node image incompatibility.

Fix: monitor the migration in real time for the first 24 hours, and have someone on call who understands the migration plan and can make quick decisions.

Ignoring daemonset compatibility.

Daemonsets like the metrics agent, network plugin, or security agent can fail if they depend on specific libraries or kernel features. The migration proceeds until suddenly every pod on new nodes can't reach the API server.

Fix: include daemonsets in pre-migration testing by deploying them on the test pool and verifying they start cleanly.

No rollback plan or executing rollback too late.

By the time teams realize the migration's broken, they've already deleted half the old pool, which means rolling back now requires restoring from backups or acknowledging the old infrastructure is gone.

Fix: define explicit rollback criteria before migration starts. If any alert fires, stop immediately and roll back, and keep at least 4 hours of old pool available.

Treating the new image as the only outcome.

Some teams commit publicly to a migration date and push through even when compatibility issues emerge, and the pressure to meet the announced timeline overrides the decision to delay and fix the incompatibility.

Fix: tie the migration date to successful testing, not the calendar. A delayed migration that works is better than a broken migration that met the deadline, so communicate early: "migration date is dependent on test results."

Image Lifecycle as Governance, Not Crisis Response

What I've learned is that the real value isn't in running one migration smoothly. It's in turning image lifecycles into a predictable, repeatable practice that your team actually knows how to do.

This requires:

A published image retirement calendar so teams can plan months ahead, not weeks
Inventory automation so you always know what's running and when it retires, without someone manually checking az aks nodepool list every Monday morning (because they'll forget, and the one Monday they skip is the one that matters)
Pre-built test pools that teams self-serve without waiting for platform approval
Standardized playbooks so the second migration is easier than the first, and the third one is almost boring
Cost tracking so you understand the real annual cost of image rotation and can budget for it properly instead of treating every migration bill as a surprise
Post-migration retrospectives so each migration teaches the next, with specific action items that actually get followed up on

When image lifecycle becomes governance instead of crisis response, a lot of operational friction just disappears. Teams plan, platform knows what to prepare, security knows what compliance gates apply, and application teams know they have to test on the new image early and sign off, which means by the actual migration day everyone's ready.

I won't pretend this is simple. It takes two or three painful migrations before the muscle memory kicks in and your org stops treating image retirement like an emergency. But that's pretty much true of every operational practice worth having. The first time hurts, the second time is awkward, and by the third time you've got a playbook that actually works. I for one have seen teams go from "this is a crisis" to "this is a Tuesday" after three rounds, and that transformation is worth the initial pain.

That being said, have a good one!