Kubernetes cluster administrators sometimes perform maintenance operations in nodes, such as hardware swaps, Kernel and Kubernetes version upgrades. On the other hand, application developers, might be interested in ensuring that their applications remain available during those maintenance operations. This is usually achieved through Disruption Budgets - here’s what the official docs have to say about them:
[They] limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.
To demonstrate how they’re utilized, let’s start with a simple
Deployment of two
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment spec: selector: matchLabels: app: nginx replicas: 2 template: metadata: labels: app: nginx
By default, old pods will be replaced by new ones following the
RollingUpdate strategy, which gradually replaces old pods with new ones. Through the
maxUnavailable options, an application developer can control how those pods are replaced. In the example below, the number of available replicas should never be lower than the deployment size itself and at most one extra pod can be created:
... strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 ...
Finally, for the sake of the example, let’s ensure that two
nginx pods never run on the same node through a
... affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: kubernetes.io/hostname ...
At this point, the application developer can rollout new versions of their deployment safely. However, the availability of the application is still at risk of being ‘disrupted’ by maintenances that affect the pods. The cluster administrator can drain all nodes in the cluster, leaving all pods of the deployment in the pending state:
$ kubectl get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/2 2 0 110m $ kubectl get pods NAME READY STATUS RESTARTS AGE nginx-deployment-855f7d96b9-8d48g 0/1 Pending 0 3m22s nginx-deployment-855f7d96b9-bgvzb 0/1 Pending 0 3m18s
This is where Disruption Budgets are useful. The application developer might decide to create the following
PodDisruptionBudget, which has similar semantics to the
RollingUpdate strategy of the
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: nginx-pdb spec: maxUnavailable: 0 selector: matchLabels: app: nginx
However, setting a Disruption Budget with
maxUnavailable: 0 has important implications:
If you set
maxUnavailableto 0% or 0, or you set
minAvailableto 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics of
Impossible node drains
As suggested by the PDB docs, if we try to drain a node where a pod protected by a PDB with such characteristics exists, it will never complete. Take a cluster with four nodes as an example:
$ kuebctl get nodes NAME STATUS ROLES AGE VERSION kind-control-plane Ready control-plane 58s v1.25.3 kind-worker Ready <none> 34s v1.25.3 kind-worker2 Ready <none> 34s v1.25.3 kind-worker3 Ready <none> 34s v1.25.3 kind-worker4 Ready <none> 33s v1.25.3 $ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-bdccd6b79-84zhj 1/1 Running 0 41s 10.244.3.2 kind-worker <none> <none> nginx-deployment-bdccd6b79-hqbgx 1/1 Running 0 41s 10.244.1.2 kind-worker3 <none> <none>
If we try to drain one of the nodes, the process will never complete:
$ kubectl drain --ignore-daemonsets kind-worker node/kind-worker cordoned Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-dkp4b, kube-system/kube-proxy-c66b6 evicting pod default/nginx-deployment-bdccd6b79-84zhj error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/nginx-deployment-bdccd6b79-84zhj error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/nginx-deployment-bdccd6b79-84zhj error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/nginx-deployment-bdccd6b79-84zhj ...
However, in some situations, such as the one above, a cluster administrator can successfully drain the node by issuing a restart of the application blocking the drain:
$ kubectl rollout restart deployment nginx-deployment deployment.apps/nginx-deployment restarted
Here’s a full example:
This creates a dilemma: Disruption Budgets with those characteristics are explicitly communicating their application’s “desire” to never be drained. On the other hand, the application’s strategy allows the application to be restarted while still respecting its
In practical terms, the cluster administrator can perform the necessary maintenances while respecting the availability characteristics from the application. But should cluster administrators restart applications in the cluster without the application developer consent?
In some circumstances, for example companies with SRE teams maintaining production clusters and Product teams developing applications, such operations could be performed, as there might not be strict terms of service in place.
I discussed this situation on the Kubernetes sig-cli Slack channel, and a few folks were receptive to the idea of giving Kubernetes users a way to automatically workaround impossible drains.
The drain logic only lives in
kubectl, which calls the Eviction API for each pod running on the node. My first idea was to introduce a new flag to
kubectl drain that would trigger
rollout restart of the blocking controllers (
ReplicaSet) when the Eviction API returned a 429 response.
When proposing this in the sig-cli fortnightly meeting, we concluded that the drain behavior sufficiently meets the existing semantics and that impossible drain situations are opportunities to educate application developers of the implications of those restrictive PDBs. For example, OpenShift’s Kubernetes Controller Manager Operator has alerts configured for those restrictive PDBs:
Standard workloads should have at least one pod more than is desired to support API-initiated eviction. Workloads that are at the minimum disruption allowed level violate this and could block node drain. This is important for node maintenance and cluster upgrades.
Nonetheless, the group suggested writing this up (this blog post) and presenting it to the sig-apps and sig-api-machinery groups. A potential proposal could be to introduce new functionality to the Eviction API, alongside a new field to the
PodDisruptionBudget spec that would trigger the update of the blocking controller. For example:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: nginx-pdb spec: rolloutRestartAllowed: true maxUnavailable: 0 selector: matchLabels: app: nginx
Here, the application developer would be explicitly granting permission to the cluster administrator to perform updates to their deployment or applications.