artur-rodrigues.com

Impossible Kubernetes node drains

by

Context

Kubernetes cluster administrators sometimes perform maintenance operations in nodes, such as hardware swaps, Kernel and Kubernetes version upgrades. On the other hand, application developers, might be interested in ensuring that their applications remain available during those maintenance operations. This is usually achieved through Disruption Budgets - here’s what the official docs have to say about them:

[They] limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.

Practical example

To demonstrate how they’re utilized, let’s start with a simple Deployment of two nginx replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx

By default, old pods will be replaced by new ones following the RollingUpdate strategy, which gradually replaces old pods with new ones. Through the maxSurge and maxUnavailable options, an application developer can control how those pods are replaced. In the example below, the number of available replicas should never be lower than the deployment size itself and at most one extra pod can be created:

...
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
...

Finally, for the sake of the example, let’s ensure that two nginx pods never run on the same node through a podAntiAffinity rule:

...
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx
              topologyKey: kubernetes.io/hostname
...

At this point, the application developer can rollout new versions of their deployment safely. However, the availability of the application is still at risk of being ‘disrupted’ by maintenances that affect the pods. The cluster administrator can drain all nodes in the cluster, leaving all pods of the deployment in the pending state:

$ kubectl get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/2     2            0           110m
$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-855f7d96b9-8d48g   0/1     Pending   0          3m22s
nginx-deployment-855f7d96b9-bgvzb   0/1     Pending   0          3m18s

This is where Disruption Budgets are useful. The application developer might decide to create the following PodDisruptionBudget, which has similar semantics to the RollingUpdate strategy of the Deployment:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: nginx

However, setting a Disruption Budget with maxUnavailable: 0 has important implications:

If you set maxUnavailable to 0% or 0, or you set minAvailable to 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics of PodDisruptionBudget.

Impossible node drains

As suggested by the PDB docs, if we try to drain a node where a pod protected by a PDB with such characteristics exists, it will never complete. Take a cluster with four nodes as an example:

$ kuebctl get nodes
NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   58s   v1.25.3
kind-worker          Ready    <none>          34s   v1.25.3
kind-worker2         Ready    <none>          34s   v1.25.3
kind-worker3         Ready    <none>          34s   v1.25.3
kind-worker4         Ready    <none>          33s   v1.25.3
$ kubectl get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
nginx-deployment-bdccd6b79-84zhj   1/1     Running   0          41s   10.244.3.2   kind-worker    <none>           <none>
nginx-deployment-bdccd6b79-hqbgx   1/1     Running   0          41s   10.244.1.2   kind-worker3   <none>           <none>

If we try to drain one of the nodes, the process will never complete:

$ kubectl drain --ignore-daemonsets kind-worker
node/kind-worker cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-dkp4b, kube-system/kube-proxy-c66b6
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
...

A workaround

However, in some situations, such as the one above, a cluster administrator can successfully drain the node by issuing a restart of the application blocking the drain:

$ kubectl rollout restart deployment nginx-deployment
deployment.apps/nginx-deployment restarted

Here’s a full example:

cast-1

This creates a dilemma: Disruption Budgets with those characteristics are explicitly communicating their application’s “desire” to never be drained. On the other hand, the application’s strategy allows the application to be restarted while still respecting its RollingUpdate spec.

In practical terms, the cluster administrator can perform the necessary maintenances while respecting the availability characteristics from the application. But should cluster administrators restart applications in the cluster without the application developer consent?

In some circumstances, for example companies with SRE teams maintaining production clusters and Product teams developing applications, such operations could be performed, as there might not be strict terms of service in place.

Possible solutions

I discussed this situation on the Kubernetes sig-cli Slack channel, and a few folks were receptive to the idea of giving Kubernetes users a way to automatically workaround impossible drains.

The drain logic only lives in kubectl, which calls the Eviction API for each pod running on the node. My first idea was to introduce a new flag to kubectl drain that would trigger rollout restart of the blocking controllers (Deployment, StatefulSet, ReplicaSet) when the Eviction API returned a 429 response.

When proposing this in the sig-cli fortnightly meeting, we concluded that the drain behavior sufficiently meets the existing semantics and that impossible drain situations are opportunities to educate application developers of the implications of those restrictive PDBs. For example, OpenShift’s Kubernetes Controller Manager Operator has alerts configured for those restrictive PDBs:

Standard workloads should have at least one pod more than is desired to support API-initiated eviction. Workloads that are at the minimum disruption allowed level violate this and could block node drain. This is important for node maintenance and cluster upgrades.

Nonetheless, the group suggested writing this up (this blog post) and presenting it to the sig-apps and sig-api-machinery groups. A potential proposal could be to introduce new functionality to the Eviction API, alongside a new field to the PodDisruptionBudget spec that would trigger the update of the blocking controller. For example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  rolloutRestartAllowed: true
  maxUnavailable: 0
  selector:
    matchLabels:
      app: nginx

Here, the application developer would be explicitly granting permission to the cluster administrator to perform updates to their deployment or applications.