artur-rodrigues.com

Kube Scheduler Metrics in Kind Clusters

2024-04-07T12:00:00+00:00

Context

While experimenting with kube-scheduler on a local Kind cluster, I was interested in its metrics. Unfortunately, they were not readily available. There were two issues:

Kind (and the underlying Kubeadm) default configuration binds the scheduler metrics server only to the loopback interface. Furthermore, it does not configure a Service for accessing the metrics.
RBAC is enabled by default, therefore we need to configure a ClusterRole that allows our workloads to access the control plane metrics.

Both issues can be observed by the fact that, out of the box, we are forced to port-forward to kube-scheduler in order to make HTTP requests, but even then, we still fail to fetch the metrics:

$ kubectl -n kube-system port-forward pod/kube-scheduler 10259:10259
Forwarding from 127.0.0.1:10259 -> 10259
Forwarding from [::1]:10259 -> 10259

$ curl -k https://localhost:10259/metrics
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

Solution

First, we need to set the --bind-address command line argument for kube-scheduler to 0.0.0.0. This can be done by creating a custom Kind config:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: my-cluster
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: ClusterConfiguration
        scheduler:
          extraArgs:
            bind-address: "0.0.0.0"
  - role: worker

We can launch a new Kind cluster with kind create cluster --config /path/to/kind-cluster.yaml. We can verify that it worked by checking the Pod spec for kube-scheduler:

$ kubectl -n kube-system get pod kube-scheduler-my-cluster-control-plane -o yaml | grep command -A5
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=0.0.0.0
    - --kubeconfig=/etc/kubernetes/scheduler.conf

Then we will need to configure a Service for the kube-scheduler metrics, as well as a ClusterRole and ClusterRoleBinding to access them.

To make use of the metrics in a productive manner, it is desirable to have an observability stack deployed in the cluster, which automatically scraps the metrics endpoint. Luckily, VictoriaMetrics has a handy Helm chart called victoria-metrics-k8s-stack that ships with both the Service and the RBAC configuration, as well as the scraping rules for the metrics and Grafana:

Since this is a test Kind cluster, we can opt for the vmsingle flavour - the default for the chart. After installing it, we can enter the vmagent pod and see what is available on the /metrics endpoint of kube-scheduler, using the service account credentials and passing the --insecure option (since the certificate bundle for Kind clusters is self-signed):

$ kubectl -n vm exec -it vmagent-vm-victoria-metrics-k8s-stack-68898f7ff5-npkwn -c vmagent -- sh
/ # curl -s -k --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://vm-victoria-metrics-k8s-stack-kube-scheduler.kube-system:10259/metrics |
grep 'queue_incoming_pods_total'
# HELP scheduler_queue_incoming_pods_total [STABLE] Number of pods added to scheduling queues by event and queue type.
# TYPE scheduler_queue_incoming_pods_total counter
scheduler_queue_incoming_pods_total{event="NodeTaintChange",queue="active"} 3
scheduler_queue_incoming_pods_total{event="PodAdd",queue="active"} 16
scheduler_queue_incoming_pods_total{event="ScheduleAttemptFailure",queue="unschedulable"} 3

Similarly, vmagent must be configured to skip certificate verification when scraping kube-scheduler while also overriding the server’s name. This can be done through Helm Values files. Here is my final overridden values file:

kubeScheduler:
  enabled: true
  endpoints: []
  service:
    enabled: true
    port: 10259
    targetPort: 10259
  spec:
    jobLabel: jobLabel
    endpoints:
      - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
        port: http-metrics
        scheme: https
        tlsConfig:
          caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecureSkipVerify: true
          serverName: "127.0.0.1"

With this configuration in place, we can verify that metrics are being scraped by vmagent:

$ kubectl -n vm port-forward svc/vmsingle-vm-victoria-metrics-k8s-stack 8429:8429
Forwarding from 127.0.0.1:8429 -> 8429
Forwarding from [::1]:8429 -> 8429

$ curl -s localhost:8429/prometheus/api/v1/query \
  -d 'query=scheduler_queue_incoming_pods_total' |\
  jq '.data.result[] | .metric.__name__, .metric.event, .value'
"scheduler_queue_incoming_pods_total"
"NodeTaintChange"
[
  1712516523,
  "3"
]
"scheduler_queue_incoming_pods_total"
"PodAdd"
[
  1712516523,
  "16"
]
"scheduler_queue_incoming_pods_total"
"ScheduleAttemptFailure"
[
  1712516523,
  "3"
]

And start building dashboards for our experiments:

Cross-Cloud Access: A Native Kubernetes OIDC Approach

2024-03-19T12:00:00+00:00

Written in collaboration with Chloe Blain

Goal

Setup a minimal working example where pods in GKE and EKS can access two buckets, one in S3 and another in GCS with no static credentials or runtime configuration. In other words, running the below commands should just work from within a pod in both clusters:

$ aws s3 ls s3://oidc-exp-s3-bucket

$ gcloud storage ls gs://oidc-exp-gcs-bucket

Context

Communicating with Cloud APIs without the use of static, long-lived credentials from within Kubernetes requires some work even when using the CSP’s managed Kubernetes versions. In AWS, this is done through EKS Pod Identities, or IAM Roles for Service Accounts (IRSA), while in GCP this is achieved through Workload Identity Federation (WIF) for GKE.

Both IRSA and Workload Identity Federation leverage OpenID Connect (OIDC) with Kubernetes configured as an Identity Provider on both clouds’ IAM to assume roles (AWS) and impersonate service accounts (GCP). The two processes are well documented online.

However, when a Kubernetes workload running in one CSP needs to access services in another CSP, the configuration might not be as straightforward and potentially have more than way of achieving the desired result.

In particular, the path from GKE to AWS APIs is not as well documented when trying to use Kubernetes itself as the Identity Provider to AWS IAM. Both AWS’s recent blog post on the topic and the doitintl/gtoken project rely on Google (not Kubernetes) being the Identity Provider and the execution of some pre-steps or configuration of Mutating Webhooks to get workloads to “just work”.

However, it is possible to achieve keyless cross-cloud access using Kubernetes OIDC as the Identity Provider for both EKS and GKE. While there’s a good amount of pre-configuration involved, the result is very flexible and fully native. This post will demonstrate how to do so.

If you’re unfamiliar with the OIDC Authentication flow, here’s one way to think about it in simple terms:

The Cluster Administrator configures the Kubernetes Cluster and its access is secure and controlled.
Kubernetes provides ServiceAccounts with an identity token that has been signed with a private key.
The Cluster Administrator reviews the Pod and ServiceAccount creations and modifications or trusts others to do so.
CSPs can be configured to accept IAM requests that come from clusters, by verifying their signature and identity - this is possible because they’ve been previously configured with the public key of the cluster.
Pods use the ServiceAccount token to authenticate with the CSP IAM and exchange them for short lived credentials that have access to other APIs.

Real World Example

A fully working IaC example is available on https://github.com/arturhoo/oidc-exp/ - below the main points are demonstrated. The examples start from the in-cloud access and then move to cross-cloud access.

For this exercise, two buckets will be created with a text file, one on each CSP.

resource "aws_s3_bucket" "s3_bucket" {
  bucket = var.s3_bucket
}

resource "aws_s3_object" "s3_object" {
  bucket  = aws_s3_bucket.s3_bucket.id
  key 	= "test.txt"
  content = "Hello, from S3!"
}

resource "google_storage_bucket" "gcs_bucket" {
  name 	= var.gcs_bucket
  location = var.gcp_region
}

resource "google_storage_bucket_object" "gcs_object" {
  bucket  = google_storage_bucket.gcs_bucket.name
  name	= "test.txt"
  content = "Hello, from GCS!"
}

EKS to AWS

To access AWS APIs from workloads in EKS there are primarily two options: EKS Pod Identities, or IAM Roles for Service Accounts (IRSA). Here the focus is on IRSA since the exercise focuses on Kubernetes OIDC.

All EKS clusters (including those with only private subnets and private endpoints) have a publicly available OIDC discovery endpoint, that allows other parties to verify the signature of potential JWT tokens (exposed in the URL under jwks_uri) that have been allegedly signed by the cluster.

$ xh https://oidc.eks.eu-west-2.amazonaws.com/id/4E604436464FFCC52F8B96807F5BD5BC/.well-known/openid-configuration
{
    "issuer": "https://oidc.eks.eu-west-2.amazonaws.com/id/4E604436464FFCC52F8B96807F5BD5BC",
    "jwks_uri": "https://oidc.eks.eu-west-2.amazonaws.com/id/4E604436464FFCC52F8B96807F5BD5BC/keys",
    "authorization_endpoint": "urn:kubernetes:programmatic_authorization",
    "response_types_supported": [
        "id_token"
    ],
    "subject_types_supported": [
        "public"
    ],
    "claims_supported": [
        "sub",
        "iss"
    ],
    "id_token_signing_alg_values_supported": [
        "RS256"
    ]
}

The first step is configuring the EKS cluster to be an Identity Provider in AWS IAM:

data "tls_certificate" "cert" {
  url = aws_eks_cluster.primary.identity[0].oidc[0].issuer
}

resource "aws_iam_openid_connect_provider" "oidc_provider" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.cert.certificates[0].sha1_fingerprint]
  url             = aws_eks_cluster.primary.identity[0].oidc[0].issuer
}

Then, a role that can read from S3 must be created. This role will have an AssumeRole policy that uses the previously configured EKS cluster as a federated identity provider. To make it more restrictive, we define a condition on the sub claim of the JWT token signed by the cluster to match the namespace and service account the workload itself will use.

resource "aws_iam_role" "federated_role" {
  name = "oidc_exp_federated_role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        "Effect" : "Allow",
        "Principal" : {
          "Federated" : aws_iam_openid_connect_provider.oidc_provider.arn
        },
        "Action" : "sts:AssumeRoleWithWebIdentity",
        "Condition" : {
          "StringEquals" : {
            "${local.eks_issuer}:aud" : "sts.amazonaws.com",
            "${local.eks_issuer}:sub" : "system:serviceaccount:default:oidc-exp-service-account"
          }
        }
      }
    ]
  })
}

resource "aws_iam_policy" "s3_read_policy" {
  name = "s3_read_policy"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Action = ["s3:GetObject", "s3:GetObjectVersion", "s3:ListBucket"],
        Resource = [
          "arn:aws:s3:::${var.s3_bucket}",
          "arn:aws:s3:::${var.s3_bucket}/*",
        ],
      },
    ],
  })
}

resource "aws_iam_role_policy_attachment" "s3_read_policy_attachment" {
  role       = aws_iam_role.federated_role.name
  policy_arn = aws_iam_policy.s3_read_policy.arn
}

Finally, on EKS, a service account with a specific annotation is needed:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: oidc-exp-service-account
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::$AWS_ACCOUNT_ID:role/oidc_exp_federated_role

apiVersion: v1
kind: Pod
metadata:
  name: aws-cli
  namespace: default
spec:
  containers:
    - name: aws-cli
      image: amazon/aws-cli
      command:
        - /bin/bash
        - -c
        - "sleep 1800"
  serviceAccountName: oidc-exp-service-account

Behind the scenes, EKS is using a Mutating Webhook Controller to mount an OIDC token signed by the cluster into the pod through a volume projection and setting an environment variable for AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN, which in turn are used by the AWS SDK as auto configuration. We can see the modifications made to the pod there weren’t originally present in the pod definition above:

$ kubectl --context aws get pod aws-cli -o yaml
apiVersion: v1
kind: Pod
metadata:
  name: aws-cli
  namespace: default
  ...
spec:
    ...
    env:
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-2
    - name: AWS_REGION
      value: eu-west-2
    - name: AWS_ROLE_ARN
      value: arn:aws:iam:::role/oidc_exp_federated_role
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: amazon/aws-cli
    ...
    name: aws-cli
    volumeMounts:
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
      ...
  ...
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
    ...
...
status:
  ...
  phase: Running

This allows reads from S3 without any other changes from within the pod:

$ kubectl --context aws exec -it gcloud-cli -- bash
bash-4.2# aws s3 ls s3://oidc-exp-s3-bucket
2024-03-17 18:29:42     	15 test.txt

Success! 1/4 complete.

GKE to GCP

As previously mentioned, the golden path is through Workload Identity Federation (WIF) for GKE. When Workload Identity is Enabled for a GKE cluster, an implicit Workload Identity Pool is created with the format PROJECT_ID.svc.id.goog, and the GKE Issuer URL configured behind the scenes.

resource "google_container_cluster" "primary" {
  ...
  workload_identity_config {
    workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
  }
}

We will also need a GCP IAM Service account with the correct permissions to read from the bucket:

resource "google_service_account" "default" {
  account_id   = "oidc-exp-service-account"
  display_name = "OIDC Exp Service Account"
}

resource "google_storage_bucket_iam_binding" "viewer" {
  bucket  = var.gcs_bucket
  role    = "roles/storage.objectViewer"
  members = ["serviceAccount:${google_service_account.default.email}"]
}

In Kubernetes, a Service Account must be created with a special annotation that will allow the GCP SDK to perform a multi-step process that intercept calls to GCP APIs and exchanges a service account token generated on-demand by the cluster for a GCP access token, which is then used to access the APIs. For this reason, contrary to EKS, no service account volume projection takes place.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: oidc-exp-service-account
  namespace: default
  annotations:
    iam.gke.io/gcp-service-account: oidc-exp-service-account@$GCP_PROJECT_ID.iam.gserviceaccount.com

For the previously mentioned token exchange to take place, the GCP IAM Service Account must have the federated K8s service account configured to assume it:

resource "google_service_account_iam_binding" "service_account_iam_binding" {
  service_account_id = google_service_account.default.name
  role               = "roles/iam.workloadIdentityUser"
  members = [
    "serviceAccount:${var.gcp_project_id}.svc.id.goog[default/oidc-exp-service-account]",
  ]
}

The pod simply uses the service account:

apiVersion: v1
kind: Pod
metadata:
  name: gcloud-cli
  namespace: default
spec:
  containers:
    - name: gcloud-cli
      image: gcr.io/google.com/cloudsdktool/google-cloud-cli:alpine
      command:
        - /bin/bash
        - -c
        - "sleep 1800"
  serviceAccountName: oidc-exp-service-account

With the pod online, we can test our GCS access:

$ kubectl --context gke exec -it gcloud-cli -- bash
gcloud-cli:/# gcloud storage ls gs://oidc-exp-gcs-bucket
gs://oidc-exp-gcs-bucket/test.txt

Success! 2/4 complete.

GKE to AWS

Things become interesting now! As previously mentioned, most of the documentation available online is about Google, not Kubernetes (GKE), being the Identity Provider. However, the GKE cluster itself can be used as the Identity Provider, like how EKS was used in the EKS to AWS section.

The first step is to configure the GKE cluster as an Identity Provider in AWS IAM:

locals {
  gke_issuer_url = "container.googleapis.com/v1/projects/${var.gcp_project_id}/locations/${var.gcp_zone}/clusters/oidc-exp-cluster"
}

resource "aws_iam_openid_connect_provider" "trusted_gke_cluster" {
  url             = "https://${local.gke_issuer_url}"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["08745487e891c19e3078c1f2a07e452950ef36f6"]
}

Similar to AWS, all GKE clusters also has a publicly available OIDC discovery endpoint:

$ xh https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster/.well-known/openid-configuration

{
    "issuer": "https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster",
    "jwks_uri": "https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster/jwks",
    "response_types_supported": [
        "id_token"
    ],
    "subject_types_supported": [
        "public"
    ],
    "id_token_signing_alg_values_supported": [
        "RS256"
    ],
    "claims_supported": [
        "iss",
        "sub",
        "kubernetes.io"
    ],
    "grant_types": [
        "urn:kubernetes:grant_type:programmatic_authorization"
    ]
}

We will want to assume the same role that the pod in EKS assumed, therefore, we just need to update the AssumeRole policy to include the following statement:

{
  "Effect" : "Allow",
  "Principal" : {
    "Federated" : aws_iam_openid_connect_provider.trusted_gke_cluster.arn
  },
  "Action" : "sts:AssumeRoleWithWebIdentity",
  "Condition" : {
    "StringEquals" : {
      "${local.gke_issuer_url}:sub" : "system:serviceaccount:default:oidc-exp-service-account",
    }
  }
},

At this point, the IAM has been configured and all that is left is configure the Pod appropriately. While we could install the Mutating Webhook Controller that AWS uses, it is also trivial to setup the service account volume projection and define the expected variables for AWS SDK to auto configuration:

apiVersion: v1
kind: Pod
metadata:
  name: aws-cli
  namespace: default
spec:
  containers:
    - name: aws-cli
      image: amazon/aws-cli
      command:
        - /bin/bash
        - -c
        - "sleep 1800"
      volumeMounts:
        - mountPath: /var/run/secrets/tokens
          name: oidc-exp-service-account-token
      env:
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: "/var/run/secrets/tokens/oidc-exp-service-account-token"
        - name: AWS_ROLE_ARN
          value: "arn:aws:iam::$AWS_ACCOUNT_ID:role/oidc_exp_federated_role"
  serviceAccountName: oidc-exp-service-account
  volumes:
    - name: oidc-exp-service-account-token
      projected:
        sources:
          - serviceAccountToken:
              path: oidc-exp-service-account-token
              expirationSeconds: 86400
              audience: sts.amazonaws.com

Here’s a sample decoded JWT token that is mounted on the pod and sent to AWS IAM, which will verify the signature and claims previously configured:

{
  "aud": [
    "sts.amazonaws.com"
  ],
  "exp": 1710979065,
  "iat": 1710892665,
  "iss": "https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster",
  "kubernetes.io": {
    "namespace": "default",
    "pod": {
      "name": "aws-cli",
      "uid": "bcf6d914-7ce5-4332-a417-510b3cbc144a"
    },
    "serviceaccount": {
      "name": "oidc-exp-service-account",
      "uid": "c56d2a4c-2622-41e1-8c7e-e3ab6eba39b5"
    }
  },
  "nbf": 1710892665,
  "sub": "system:serviceaccount:default:oidc-exp-service-account"
}

At this point the pod is ready to be launched and the S3 bucket can be listed without any further configuration:

$ kubectl --context gke exec -it aws-cli -- bash
bash-4.2# aws s3 ls s3://oidc-exp-s3-bucket
2024-03-17 18:29:42         15 test.txt

Success! 3/4 complete.

EKS to GCP

The final configuration is from EKS to GCP. While GKE clusters are configured as OIDC providers in the project-default Workload Identity Pool, we can’t add custom providers there. Therefore, we need to create a new pool:

locals {
  workload_identity_pool_id = "oidc-exp-workload-identity-pool"
}

resource "google_iam_workload_identity_pool" "pool" {
  workload_identity_pool_id = local.workload_identity_pool_id
}

Then, we need to add the EKS cluster as a provider. Note that we’re using the same OIDC issuer URL as we did in the EKS to AWS section.

resource "google_iam_workload_identity_pool_provider" "trusted_eks_cluster" {
  workload_identity_pool_id          = google_iam_workload_identity_pool.pool.workload_identity_pool_id
  workload_identity_pool_provider_id = "trusted-eks-cluster"

  attribute_mapping = {
    "google.subject" = "assertion.sub"
  }

  oidc {
    issuer_uri = aws_eks_cluster.primary.identity[0].oidc[0].issuer
  }
}

Finally, we want the pods in EKS to be able to impersonate the GCP IAM Service Account we previously created for the GKE to GCP path. Therefore, we add a new member to the existing policy binding:

resource "google_service_account_iam_binding" "binding" {
  service_account_id = google_service_account.default.name
  role               = "roles/iam.workloadIdentityUser"

  members = [
    "principal://iam.googleapis.com/projects/${data.google_project.project.number}/locations/global/workloadIdentityPools/${local.workload_identity_pool_id}/subject/system:serviceaccount:default:oidc-exp-service-account",
    "serviceAccount:${var.gcp_project_id}.svc.id.goog[default/oidc-exp-service-account]",
  ]
}

Different from the GKE to GCP path, there’s no magic interception of requests. The Kubernetes crafted JWT token will be used to authenticate with the GCP APIs. Therefore, the pod must be configured to both mount the K8s Service Account token and set the CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE environment variable to a JSON file that informs the GCP SDK how to use it and what service account to impersonate. Normally, this JSON can be constructed using the gcloud iam workload-identity-pools create-cred-config command. However, since the structure is static, we can simply define it ahead of time as a ConfigMap:

apiVersion: v1
data:
  credential-configuration.json: |-
    {
      "type": "external_account",
      "audience": "//iam.googleapis.com/projects/$GCP_PROJECT_NUMBER/locations/global/workloadIdentityPools/oidc-exp-workload-identity-pool/providers/trusted-eks-cluster",
      "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
      "token_url": "https://sts.googleapis.com/v1/token",
      "credential_source": {
        "file": "/var/run/service-account/token",
        "format": {
          "type": "text"
        }
      },
      "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/oidc-exp-service-account@$GCP_PROJECT_ID.iam.gserviceaccount.com:generateAccessToken"
    }
kind: ConfigMap
metadata:
  name: oidc-exp-config-map
  namespace: default

And the Pod:

apiVersion: v1
kind: Pod
metadata:
  name: gcloud-cli
  namespace: default
spec:
  containers:
    - name: gcloud-cli
      image: gcr.io/google.com/cloudsdktool/google-cloud-cli:alpine
      command:
        - /bin/bash
        - -c
        - "sleep 1800"
      volumeMounts:
        - name: token
          mountPath: "/var/run/service-account"
          readOnly: true
        - name: workload-identity-credential-configuration
          mountPath: "/var/run/secrets/tokens/gcp-ksa"
          readOnly: true
      env:
        - name: CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE
          value: "/var/run/secrets/tokens/gcp-ksa/credential-configuration.json"
  serviceAccountName: oidc-exp-service-account
  volumes:
    - name: token
      projected:
        sources:
          - serviceAccountToken:
              audience: https://iam.googleapis.com/projects/$GCP_PROJECT_NUMBER/locations/global/workloadIdentityPools/oidc-exp-workload-identity-pool/providers/trusted-eks-cluster
              expirationSeconds: 3600
              path: token
    - name: workload-identity-credential-configuration
      configMap:
        name: oidc-exp-config-map

And without any further configuration, the pod can access the GCS bucket:

$ kubectl --context aws exec -it gcloud-cli -- bash
gcloud-cli:/# gcloud storage ls gs://oidc-exp-gcs-bucket
gs://oidc-exp-gcs-bucket/test.txt

Success! 4/4 complete of the scenarios have been successful!

Appendix 1 - GKE to GCP as a vanilla OIDC Provider

While the above example for GKE to GCP is the recommended way to access GCP resources from Kubernetes, after seeing how the EKS to GCP access is done, one is left wondering if we can bypass the magic interception of requests altogether! In fact, that is definitely possible and actually results in an implementation that is even more consistent across the two clouds.

The first step is to remove the workload_identity_config and workload_metadata_config configurations from the GKE Cluster and Node Pool configurations in Terraform. Then, a new google_iam_workload_identity_pool_provider resource for the GKE cluster must be created:

resource "google_iam_workload_identity_pool_provider" "trusted_gke_cluster" {
  workload_identity_pool_id          = google_iam_workload_identity_pool.pool.workload_identity_pool_id
  workload_identity_pool_provider_id = "trusted-gke-cluster"

  attribute_mapping = {
    "google.subject" = "assertion.sub"
  }

  oidc {
    issuer_uri = local.gke_issuer_url
  }
}

Since we aren’t relying on GCP’s magic, we can also remove the GKE annotation from the K8s service account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: oidc-exp-service-account
  namespace: default

Finally, the Pod spec for gcloud-cli becomes identical to the EKS one, which requires the creation of the ConfigMap.

Automating multi architecture workloads in Kubernetes

2024-02-19T08:00:00+00:00

When scheduling workloads, a vanilla Kubernetes installation is unaware of the compatibility of the container images that compose a Pod and the target node architecture. In the best case scenario, the workload is composed of container images that support multiple architectures (arm64 and amd64), allowing any node to be selected for housing that workload and letting the container runtime itself running in that node fetch the correct image.

However, there might be certain cases where a particular Pod contains container images built for a single architecture (for example, amd64) - in such cases, Kubernetes won’t prevent that workload from being scheduled on an arm64 node, unless the cluster administrator and/or the application owner have made extra configurations. What are those configurations?

The first possibility is for the cluster administrator to have decided to only run amd64 nodes. That is likely to be largely compatible with all open source tools and images, as amd64 remains the default architecture in the cloud, and most build pipelines will target it. In other words, simply by running only the amd64 architecture, a cluster administrator will probably never face issues around container image compatibility.

The other possibility is for the cluster administrator to put a taint on all arm64 nodes in the cluster. In fact, on GKE this is done by default:

By default, GKE schedules workloads only to x86-based nodes—Compute Engine machine series with Intel or AMD processors—by placing a taint (kubernetes.io/arch=arm64:NoSchedule) on all Arm nodes. This taint prevents x86-compatible workloads from being inadvertently scheduled to your Arm nodes

Outside of GKE, a cluster administrator might do the same to the node groups/pools that include ARM instances. In those cases, the pods will need a toleration to be able to run on such nodes.

Finally, application owners also have the possibility to use node selectors to ensure their workloads are only scheduled on amd64 nodes, by targeting the default label kubernetes.io/arch with value amd64.

Automating multi architecture clusters

As long as the cluster administrator taints nodes that are part of a node group that might include arm64 instances, we can automate the inclusion of tolerations through a Mutating Admission Controller. This controller will intercept all pod creation events, check the supported architectures for all containers specified in the spec, and include a toleration if all images are multiarch (arm64 and amd64 compatible):

func DoesPodSupportArm64(cache Cache, pod *corev1.Pod) bool {
	supported := true
	for _, container := range pod.Spec.Containers {
		if !DoesImageSupportArm64(cache, container.Image) {
			supported = false
		}
	}
	return supported
}

A proof-of-concept can be found on arturhoo/k8smultiarcher. In particular, regclient/regclient is used to fetch all the supported platforms/architectures through the Manifests V2 API, which does not require downloading the full image. The project also incorporates a cache and a fail-open mechanism (in case of failures or timeout, no toleration is added).

One important implementation detail is that the toleration added is for a multi-architecture node group, not a arm64-only one. This is particularly important when running a cluster autoscaling strategy (e.g. Karpenter) that taps into several instance types and spot offerings, i.e. at certain times it might be cheaper to run amd64 nodes.

A full end-to-end example can be found on the test suite for the project.

Rate limiting Kubernetes pod creation with dynamic admission control

2023-10-22T08:00:00+00:00

Resource Quotas and Limit Ranges are common ways to limit the number of pods (or resources used by pods) in Kubernetes clusters. However, when using Jobs for big-data or machine-learning pipelines it might be desirable to also start considering the rate which pods are created, especially if jobs are short-lived and there’s a concern that the control plane might be overwhelmed.

The first line of defence should be configuring the API server flags --max-requests-inflight and --max-mutating-requests-inflight, followed by configuring API Priority and Fairness, which allows for fine grained requests to be deprioritised (and ultimately rate limited) relative to other requests. Finally, the alpha Event Rate Limit can put a ceiling on the number of requests per second sent to the API server on a given namespace, for example.

Thinking about a final line of defence, I decided to explore implementing an admission webhook that would be configured (through a ValidatingWebhookConfiguration) to intercept all pod creation requests and enforce a rate limit.

var limiter = rate.NewLimiter(rate.Every(10*time.Second), 1)

func validatingHandler(c *gin.Context) {
	var review admissionv1.AdmissionReview
	if err := c.Bind(&review); err != nil {
		return
	}

	allowed := limiter.Allow()
	var status, msg string
	if allowed {
		status = metav1.StatusSuccess
	} else {
		status = metav1.StatusFailure
		msg = "rate limit exceeded"
	}

	review.Response = &admissionv1.AdmissionResponse{
		UID:     review.Request.UID,
		Allowed: allowed,
		Result: &metav1.Status{
			Status:  status,
			Message: msg,
		},
	}
	c.JSON(200, review)
}

Using golang.org/x/time/rate, we keep a limiter that allows one request every 10 seconds. If the request is allowed, we return StatusSuccess, otherwise we return a StatusFailure which will prevent the pod from being created.

The configuration itself, defines a rule that narrows the scope to only pod creation with a ‘fail open’ failure policy:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: k8slimiter-pod-creation
  annotations:
    cert-manager.io/inject-ca-from: k8slimiter/k8slimiter-certificate
webhooks:
  - name: k8slimiter-pod-creation.k8slimiter.svc
    admissionReviewVersions:
      - v1
    clientConfig:
      service:
        name: k8slimiter-service
        namespace: k8slimiter
        path: "/validate"
    rules:
      - apiGroups: [""]
        apiVersions: ["v1"]
        operations: ["CREATE"]
        resources: ["pods"]
    failurePolicy: Ignore
    sideEffects: None

With those in place, creating pods in quick succession leads to the expected rate limiting behaviour:

$ kubectl run "tmp-pod-$(date +%s)" --restart Never --image debian:12-slim -- sleep 1
pod/tmp-pod-1698005111 created
$ kubectl run "tmp-pod-$(date +%s)" --restart Never --image debian:12-slim -- sleep 1
Error from server: admission webhook "k8slimiter-pod-creation.k8slimiter.svc" denied the request: rate limit exceeded

A full working example can be found on arturhoo/k8slimiter, which leverages Gin and cert-manager to achieve a minimal and straightforward admission webhook setup.

Impossible Kubernetes node drains

2023-03-30T08:00:00+00:00

Context

Kubernetes cluster administrators sometimes perform maintenance operations in nodes, such as hardware swaps, Kernel and Kubernetes version upgrades. On the other hand, application developers, might be interested in ensuring that their applications remain available during those maintenance operations. This is usually achieved through Disruption Budgets - here’s what the official docs have to say about them:

[They] limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.

Practical example

To demonstrate how they’re utilized, let’s start with a simple Deployment of two nginx replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx

By default, old pods will be replaced by new ones following the RollingUpdate strategy, which gradually replaces old pods with new ones. Through the maxSurge and maxUnavailable options, an application developer can control how those pods are replaced. In the example below, the number of available replicas should never be lower than the deployment size itself and at most one extra pod can be created:

...
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
...

Finally, for the sake of the example, let’s ensure that two nginx pods never run on the same node through a podAntiAffinity rule:

...
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx
              topologyKey: kubernetes.io/hostname
...

At this point, the application developer can rollout new versions of their deployment safely. However, the availability of the application is still at risk of being ‘disrupted’ by maintenances that affect the pods. The cluster administrator can drain all nodes in the cluster, leaving all pods of the deployment in the pending state:

$ kubectl get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/2     2            0           110m
$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-855f7d96b9-8d48g   0/1     Pending   0          3m22s
nginx-deployment-855f7d96b9-bgvzb   0/1     Pending   0          3m18s

This is where Disruption Budgets are useful. The application developer might decide to create the following PodDisruptionBudget, which has similar semantics to the RollingUpdate strategy of the Deployment:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: nginx

However, setting a Disruption Budget with maxUnavailable: 0 has important implications:

If you set maxUnavailable to 0% or 0, or you set minAvailable to 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics of PodDisruptionBudget.

Impossible node drains

As suggested by the PDB docs, if we try to drain a node where a pod protected by a PDB with such characteristics exists, it will never complete. Take a cluster with four nodes as an example:

$ kuebctl get nodes
NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   58s   v1.25.3
kind-worker          Ready              34s   v1.25.3
kind-worker2         Ready              34s   v1.25.3
kind-worker3         Ready              34s   v1.25.3
kind-worker4         Ready              33s   v1.25.3
$ kubectl get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
nginx-deployment-bdccd6b79-84zhj   1/1     Running   0          41s   10.244.3.2   kind-worker               
nginx-deployment-bdccd6b79-hqbgx   1/1     Running   0          41s   10.244.1.2   kind-worker3              

If we try to drain one of the nodes, the process will never complete:

$ kubectl drain --ignore-daemonsets kind-worker
node/kind-worker cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-dkp4b, kube-system/kube-proxy-c66b6
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
...

A workaround

However, in some situations, such as the one above, a cluster administrator can successfully drain the node by issuing a restart of the application blocking the drain:

$ kubectl rollout restart deployment nginx-deployment
deployment.apps/nginx-deployment restarted

Here’s a full example:

This creates a dilemma: Disruption Budgets with those characteristics are explicitly communicating their application’s “desire” to never be drained. On the other hand, the application’s strategy allows the application to be restarted while still respecting its RollingUpdate spec.

In practical terms, the cluster administrator can perform the necessary maintenances while respecting the availability characteristics from the application. But should cluster administrators restart applications in the cluster without the application developer consent?

In some circumstances, for example companies with SRE teams maintaining production clusters and Product teams developing applications, such operations could be performed, as there might not be strict terms of service in place.

Possible solutions

I discussed this situation on the Kubernetes sig-cli Slack channel, and a few folks were receptive to the idea of giving Kubernetes users a way to automatically workaround impossible drains.

The drain logic only lives in kubectl, which calls the Eviction API for each pod running on the node. My first idea was to introduce a new flag to kubectl drain that would trigger rollout restart of the blocking controllers (Deployment, StatefulSet, ReplicaSet) when the Eviction API returned a 429 response.

When proposing this in the sig-cli fortnightly meeting, we concluded that the drain behavior sufficiently meets the existing semantics and that impossible drain situations are opportunities to educate application developers of the implications of those restrictive PDBs. For example, OpenShift’s Kubernetes Controller Manager Operator has alerts configured for those restrictive PDBs:

Standard workloads should have at least one pod more than is desired to support API-initiated eviction. Workloads that are at the minimum disruption allowed level violate this and could block node drain. This is important for node maintenance and cluster upgrades.

Nonetheless, the group suggested writing this up (this blog post) and presenting it to the sig-apps and sig-api-machinery groups. A potential proposal could be to introduce new functionality to the Eviction API, alongside a new field to the PodDisruptionBudget spec that would trigger the update of the blocking controller. For example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  rolloutRestartAllowed: true
  maxUnavailable: 0
  selector:
    matchLabels:
      app: nginx

Here, the application developer would be explicitly granting permission to the cluster administrator to perform updates to their deployment or applications.

Experiments with Kafka’s head-of-line blocking

2023-03-21T12:00:00+00:00

Context

Kafka is a distributed message system that excels in high throughput architectures with many listeners. However, Kafka is also often used as job queue solution and, in this context, its head-of-line blocking characteristics can lead to increased latency. Let’s build an experiment to explore it in practice.

Kafka Architecture

Messages are sent to topics in Kafka which are hashed and assigned to partitions - one topic has one or more partitions. Multiple consumers can read from a topic by forming a Consumer Group, with each one being automatically assigned a subset of the partitions for a given topic.

No two consumers from the same Consumer Group can read from the same partition. Therefore, to avoid idle consumers, a topic must have at least as many partitions as there are consumers.

At this point, head-of-line blocking might be starting to make sense. If Consumer 0 takes a long time to perform the work associated with a message (either because the work is expensive or because it is under resource pressure), all other pending messages in the partitions it is responsible for will remain pending.

Side note: where Kafka message streaming capabilities really shine is when you have many subscribers. A new consumer group can be formed and process the same messages as the original group, on its own pace. At this point, it is no longer a worker queue in the traditional sense.

Beanstalkd Architecture

This is in contrast to other solutions like RabbitMQ or beanstalkd where, regardless of the number of consumers, pending jobs will be served to the first consumer that asks for one on a given queue.

Let’s take a look at beanstalkd, which I have introduced in a previous blog post:

With beanstalkd, jobs are sent to tubes. Consumers simply connect to the server and reserve jobs from a given tube. For a given beanstalkd server, jobs are given out in the same order they were enqueued.

Here, head-of-line blocking is no longer a concern, as jobs will continue to be served from the queue to available consumers even if a particular consumer is slow. Contrary to Kafka with multiple consumer groups, a job in a tube cannot be served to two consumers in the happy path. When reservations times out, beanstalkd will requeue that job. These are traditional work queue primitives.

Experiment

In this experiment, each job represents a unit of work: a synchronous sleep. The sleep duration is determined by the producer that creates 100 jobs in total. Every job has a sleep value of 0, except for 4 of them which have a sleep value of 10s.

beanstalkd_tube = beanstalkd.tubes[BEANSTALKD_MAIN_TUBE]
100.times do |i|
  msg = (i % 25).zero? ? 10 : 0

  beanstalkd_tube.put(msg.to_s)

  kafka_producer.produce(
    topic: KAFKA_MAIN_TOPIC,
    payload: msg.to_s,
    key: "key-#{i}"
  )
end

If we only had a single consumer, the total time to complete all jobs would be at least 40s, as that consumer would sleep for 10s four times. If we had an unlimited number of consumers, the minimum total time would be 10s, as at least four consumers would have to sleep for 10s in parallel.

Back to the experiment, both Kafka and beanstalkd are set up, each with five consumers. The Kafka topic is configured with 10 partitions, therefore, each Kafka consumer is responsible for two partitions, in a single consumer group configuration. Below are the implementations for each consumer type:

consumer.subscribe(KAFKA_MAIN_TOPIC)
consumer.each do |msg|
  duration = msg.payload.to_i
  log.info 'Going to sleep' if duration.positive?
  sleep(msg.payload.to_i)
  producer.produce(
    topic: KAFKA_COUNTER_TOPIC,
    payload: 'dummy'
  )
end

main_tube = beanstalkd.tubes[BEANSTALKD_MAIN_TUBE]
counter_tube = beanstalkd.tubes[BEANSTALKD_COUNTER_TUBE]
loop do
  job = main_tube.reserve
  duration = job.body.to_i
  log.info 'Going to sleep' if duration.positive?
  sleep(duration)
  counter_tube.put('dummy')
  job.delete
end

After sleeping, consumers produce a dummy message to a different topic/tube, which is used by an out of bound watcher process that keeps track of global progress. Each watcher process starts the clock when the first dummy message is received and stops i when the 100th message is received.

To kickstart the experiment, we start both Kafka and beanstalkd, five consumers for each and the two watcher processes:

$ docker-compose up
queue-beanstalkd-watcher-1   | I, [2023-03-19T22:03:59] Started beanstalkd watcher
queue-beanstalkd-consumer-1  | I, [2023-03-19T22:04:00] Connected to beanstalkd
queue-beanstalkd-consumer-3  | I, [2023-03-19T22:04:01] Connected to beanstalkd
queue-beanstalkd-consumer-4  | I, [2023-03-19T22:04:01] Connected to beanstalkd
queue-beanstalkd-consumer-5  | I, [2023-03-19T22:04:02] Connected to beanstalkd
queue-beanstalkd-consumer-2  | I, [2023-03-19T22:04:02] Connected to beanstalkd
queue-kafka-define-topic-1   | I, [2023-03-19T22:04:11] Topics created!
queue-kafka-define-topic-1 exited with code 0
queue-kafka-watcher-1        | I, [2023-03-19T22:04:12] Started Kafka watcher
queue-kafka-consumer-2       | I, [2023-03-19T22:04:13] Subscribed to kafka topic
queue-kafka-consumer-1       | I, [2023-03-19T22:04:14] Subscribed to kafka topic
queue-kafka-consumer-4       | I, [2023-03-19T22:04:14] Subscribed to kafka topic
queue-kafka-consumer-5       | I, [2023-03-19T22:04:14] Subscribed to kafka topic
queue-kafka-consumer-3       | I, [2023-03-19T22:04:15] Subscribed to kafka topic

At this point, without no messages having been produced, we can inspect the topology of Kafka partitions and consumers:

$ kafka-consumer-groups.sh --describe --group main-group --bootstrap-server localhost:9092
GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
main-group      main            8          -               0               -               rdkafka-c12c408c-3da7-48b8-922e-17053059b828 /172.19.0.12    rdkafka
main-group      main            9          -               0               -               rdkafka-c12c408c-3da7-48b8-922e-17053059b828 /172.19.0.12    rdkafka
main-group      main            0          -               0               -               rdkafka-57fb04b5-4c10-4403-894c-587bb95a285e /172.19.0.15    rdkafka
main-group      main            1          -               0               -               rdkafka-57fb04b5-4c10-4403-894c-587bb95a285e /172.19.0.15    rdkafka
main-group      main            2          -               0               -               rdkafka-686169bc-eef9-498b-a7ca-a243c401f4bd /172.19.0.13    rdkafka
main-group      main            3          -               0               -               rdkafka-686169bc-eef9-498b-a7ca-a243c401f4bd /172.19.0.13    rdkafka
main-group      main            6          -               0               -               rdkafka-98349f3c-f097-450c-a1a1-82c3adef1fd3 /172.19.0.14    rdkafka
main-group      main            7          -               0               -               rdkafka-98349f3c-f097-450c-a1a1-82c3adef1fd3 /172.19.0.14    rdkafka
main-group      main            4          -               0               -               rdkafka-87de172e-6759-46d5-b788-e27e5fb52e02 /172.19.0.11    rdkafka
main-group      main            5          -               0               -               rdkafka-87de172e-6759-46d5-b788-e27e5fb52e02 /172.19.0.11    rdkafka
main-group      counter         0          -               0               -               rdkafka-b6c8a89e-cb22-4872-85c5-57cf5da68756 /172.19.0.10    rdkafka

As seen above, each consumer has been assigned two partitions, and all 10 are empty. Time to produce the 100 messages:

$ ruby producer.rb

And wait for the results:

queue-beanstalkd-consumer-1  | I, [2023-03-19T22:04:28] Going to sleep
queue-beanstalkd-watcher-1   | I, [2023-03-19T22:04:28] Started beanstalkd clock!
queue-beanstalkd-consumer-3  | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-consumer-1       | I, [2023-03-19T22:04:28] Going to sleep
queue-beanstalkd-consumer-5  | I, [2023-03-19T22:04:28] Going to sleep
queue-beanstalkd-consumer-4  | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-consumer-2       | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-consumer-5       | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-watcher-1        | I, [2023-03-19T22:04:28] Started Kafka clock!
queue-beanstalkd-watcher-1   | I, [2023-03-19T22:04:38] beanstalkd took 10s to complete!
queue-kafka-consumer-2       | I, [2023-03-19T22:04:38] Going to sleep
queue-kafka-watcher-1        | I, [2023-03-19T22:04:48] Kafka took 20s to complete!

The full experiment is available on github.com/arturhoo/kafka-experiment.

Results

From the watcher times above, we can clearly see a difference between the two setups: Kafka’s took double the amount of time to process all 100 messages. The head-of-line blocking behavior, however, has further implications. By capturing the timestamp where each nth job is completed (as measured by the watcher), we can plot the global process for both setups:

As seen above, the beanstalkd setup was able to process 96 out of the 100 messages in less than one second. The Kafka setup, however, had two long 10s periods of time where no messages was processed - that is because there was at one consumer (queue-kafka-consumer-2) who was assigned two messages with a sleep duration of 10s.

This is in contrast with the beanstalkd setup, where four consumers slept in parallel while the fifth consumer (beanstalkd-consumer-2) was able to empty the queue, effectively working more than its peers.

–

Thanks @javierhonduco for reviewing this post.

Reverse proxy with dynamic backend selection

2023-03-12T12:00:00+00:00

Context

Traditionally, reverse proxies are configured with a static set of rules which determines the correct upstream/backend. When put in front of a sharded architecture, they might route traffic to the appropriate backend based on a subdomain (e.g., us-east-1.example.com) or a path (e.g., example.com/europe-west-2).

This can be particularly common if you have the same application deployed in two different jurisdictions (data and control plane). Most times it is enough to have customers use the unambiguous URL for interacting with an application - in those cases a global reverse proxy (or API Gateway) might even not exist.

However, sometimes it might be desirable (or necessary) to have a unique hostname that serves all customers. For example, you might want POST request to be sent to a short URL, using JSON Web Tokens for authorization. Or you might be creating a Github App that can only configure a single webhook URL to receive events.

In such situations, for every request, we need to look up the correct backend for that request based on its contents (headers, body, query parameters) before dispatching it. The static rules from traditional reverse proxies aren’t enough in this case.

Proposed Solution

This can be solved quite easily with Caddy. Here are the components in our proof of concept:

Two customers
- waitrose served by the European backend
- walmart served by the American backend
Redis for storing the mapping between customers and backends
Ruby and OpenSSL for generating a JWT
Caddy as a reverse proxy layer
Backend servers are simple Gin applications

First, we will populate our shard look-up table in Redis:

> SET walmart 'us-east-1:8080'
> SET waitrose 'europe-west-2:8080'

In this example, a request will be sent on behalf of customer waitrose. Since the customer information will be embedded in the JTW, we need to a way to generate a token. First, we will generate asymmetric keys (symmetric would also have worked):

$ openssl genrsa -out cert/id_rsa 2048
$ openssl rsa -in cert/id_rsa -pubout > cert/id_rsa.pub

Next, we leverage Ruby’s conciseness to generate the JWT:

require 'openssl'
require 'jwt'

priv = OpenSSL::PKey::RSA.new(File.open('cert/id_rsa'))
JWT.encode({customer: 'waitrose'}, priv, 'RS256')

Brilliant, we have everything we need to send a request. Next, we implement our own Caddy module that allows for the dynamic selection of a backend. Here’s a brief description of its behaviour:

Intercept the request
Decode the token under the Authorization header using the Bearer schema
Look up the correct shard from Redis
Save the shard information in a variable called shard.upstream - this variable will be exposed in the Caddyfile
Enrich the request with an extra header X-Customer (more on it later)

And the code:

func (m JWTShardRouter) ServeHTTP(w http.ResponseWriter, r *http.Request, next caddyhttp.Handler) error {
    authHeader := r.Header.Get("Authorization")
    tokenStr := strings.TrimPrefix(authHeader, "Bearer ")

    claims := ParseJWT(tokenStr)
    customer, _ := claims["customer"].(string)
    r.Header.Set("X-Customer", customer)

    shard, _ := rdb.Get(ctx, customer).Result()
    caddyhttp.SetVar(r.Context(), "shard.upstream", shard)

    return next.ServeHTTP(w, r)
}

Finally, we use the registered shard.upstream variable in our Caddyfile

{
    order jwt_shard_router before method
}

http://localhost:5000 {
    jwt_shard_router
    reverse_proxy {
        to {http.vars.shard.upstream}
    }
}

Only the backend server left now. Since this is just a proof of concept, it doesn’t do much. It replies to requests coming to / and leverages the fact that Caddy has already decoded the customer from the JWT and put that information in the X-Customer header. Knowing the customer, it greets them in the response while including the shard name (provided through an environment variable) in the X-Shard header. This response from backend server demonstrates that the process works end-to-end.

func main() {
    r := gin.Default()
    r.GET("/", func(c *gin.Context) {
        customer := c.Request.Header.Get("X-Customer")
        c.Header("X-Shard", os.Getenv("SHARD"))
        c.JSON(http.StatusOK, gin.H{
            "message": fmt.Sprintf("Hello %s!", customer),
        })
    })

    r.Run()
}

Time to test our POC. We spin up our patched Caddy server, Redis and the two backend servers:

$ docker-compose up
...
$ docker-compose ps
SERVICE             COMMAND                     PORTS
caddy               "/caddy run"                0.0.0.0:5000->5000/tcp
europe-west-2       "/upstream"
redis               "docker-entrypoint.s…"      6379/tcp
us-east-1           "/upstream"

And issue the request:

$ http localhost:5000 -A bearer -a $WAITROSE_TOKEN
HTTP/1.1 200 OK
Content-Length: 29
Content-Type: application/json; charset=utf-8
Date: Sun, 12 Mar 2023 12:00:00 GMT
Server: Caddy
X-Shard: europe-west-2

{
    "message": "Hello waitrose!"
}

Success! A full example is available on github.com/arturhoo/caddyshardrouter.

Why Caddy and Alternatives

I’ve chosen Caddy as it has been in my radar for a while for its focus on developer experience - as seen above, the dynamic selection of upstream servers was made possible in less than 80 lines of code. It has also had the opportunity to mature with the v2 rewrite.

Being written in Go allows us to generate a self-contained binary that can easily be placed in a distroless image. To further exemplify Caddy’s focus on devx, the xcaddy utility allows us to build a patched Caddy server with our module through a single command.

Here are some potential alternatives:

OpenResty: powered by Nginx, writes custom Lua modules to be written.
HAProxy: offers HAProxy Maps which coupled with the possibility of extending it with Lua might offer a compelling alternative.
Kong: takes OpenResty one step further by facilitating the development of new Lua plugins. Is considered an API Gateway.
Apache APISIX: also an API Gateway written in Lua. However, plugins can be written in Go and Python.
Envoy Proxy: proxy powering Istio. Allows for dynamic configuration with custom control planes.

References

Risoto de Linguiça com Cebola Caramelizada

2020-05-16T20:00:00+00:00

Serve duas pessoas.

Ingredientes

120g de arroz carnaroli ou arbóreo
250g de linguiça temperada
1 cebola grande
100ml de vinho branco
2 dentes de alho
cebolinha
queijo parmesão à gosto

Método

Tempo de cozimento: 50 minutos.

Instruções:

Frite as linguiças na panela até dourarem bem em todos os lados - é necessário que elas fritem bem, deixando uma camada marrom escura no fundo da panela. Reserve
Faça um corte longitudinal e depois corte cada metade no sentido contrário, resultando em meia rodelas
Ferva um litro de água. Reserve
Sem lavar ou retirar a gordura restante na panela utilizada pelas linguiças, leve as cebolas e uma boa pitada de sal. Fogo médio - elas vão começar a fritar e murchar, adicione uma quantidade de água suficiente para cobrir o fundo e utilizando uma colher raspe o fundo da panela, soltando os resíduos restantes da fritura da linguiça (deglacear). Tampe
Após dois ou três minutos o líquido terá reduzido completamente e as cebolas estarão formando resíduos no fundo da panela. Adicione pouca água, raspe, cobra. Repita por cerca de 25 minutos, até que as cebolas estejam caramelizadas. Reserve
Lave a panela. Adicione azeite e o alho picado em fatias finas. Fogo médio. Caso queira, adicione flocos de pimenta seca (calabresa por exemplo)
Antes do alho dourar, adicione o arroz. Doure por um minuto
Adicione o vinho branco, mexa para deixar homogêneo
Antes do vinho evaporar por completo, adicione 75ml de água, misture. Repita esse processo até que o arroz comece a ficar cozido, mas não ao dente
Pique a linguiça em pequenos pedaços de 1cm, adicione ao risoto. Adicione um pouco mais de água e mexa até que ele esteja ao dente
Adicione a cebola caramelizada, queijo parmesão e cebolinha picada. Misture uma última vez até que atinja a consistência desejada. Para um pouco mais de riqueza, adicione uma colher de manteiga

Emprate com queijo parmesão ralado e cebolinha.

Dinner at OCD

2019-06-09T10:00:00+00:00

Close to the buzzy streets of Jaffa is OCD, a single room where chef Raz Rahav and his team meticulously prepare the season changing tasting menu to an audience of 20 guests. Guests sit around a bar-style counter, in a concept similar to London’s Kitchen Table

We were the first guests to arrive and were sat in the leftmost corner of the bar - this allowed us to be very close to the action and be served by Raz himself for the main courses. We were immediately served a refreshing gin and tonic granita with peanuts. We opted for the Yarden Blanc de Blancs, which I hadn’t tried yet on my trip. It paired well with the snacks that were served while the remainder of the guests were still arriving: beef tartare, carrot crisps, asparagus tartlet, chickpea panisse and mini sandwiches made from dehydrated carrot juice

Service was outstanding throughout the evening, with every member of staff greeting us at some point and food restrictions being acknowledged when needed. The restaurant also serves Acqua Panna mineral still and sparkling free of charge - a big plus in my books.

Moving on, we had yellow tail sashimi served with raspberry, tomato and edible flower. The texture and depth of flavor of the fish were impressive and my friend remarked it as being the best raw fish he had ever eaten

Followed by the fish, were the outstanding Parkerhouse Rolls served with whipped tomato cream - I tried saving some for the courses to come, but it was far too good. Over the next hour, we would observe the rigorous precision, from presentation to technique and timing, of Raz and his team prepare the next three main courses. A Chateau Golan red seemed appropriate

After the artichoke gnudi and the grilled grouper with aspargus, the main stars of the night - local ducks - left the oven and were carved up in two dishes: the magret, rare with a crispy skin, sliced and served with walnuts; and the confit prepared as a rillette. Both were fantastic, but the subtleness of the rillette might have been the highlight of the night

In the interlude to the dessert courses, aged goat cheese was served. The restaurant uses local ingredients only - which reminded me of Domestic in Aarhus

For the sweet courses, those local ingredients were combined into one of the best dessert courses I’ve had in a restaurant: artichoke, lemon and cardamom. They were followed by a cloud dense fennel parfait

As my friend and I recollected the past two and half hours, we acknowledged the perfection of both timing and quantity in the restaurant’s tasting menu. OCD’s simple and modern decor extends to the toilets, which are equipped with locally produced amenities

Sunday Lunch at Hide

2019-05-26T12:00:00+00:00

A few steps from Green Park Station is HIDE, a restaurant that culminated from a collaboration between Dabbous and Hedonism Wines

We sat in a corner table facing peaceful Green Park. We opted for the £48 set lunch menu, starting with finger-food style vegetables, a strawberry gazpacho, bread and cold meats

For starters, Laura went with the beef tartare and I opted for gouda custard with wild garlic - both were gentle on the palate

We continued with the cod and the chicken with spätzle - they were served in a nice covered bowl that complemented the restaurant décor

Service was friendly, attentive and not overindulging, which is nice for a Sunday afternoon. The oak staircase connects the three floors of the restaurant. For the last course, both of us decided for the almond and apricot soufflé with osmanthus ice cream

The toilets have Le Labo amenities and the brand’s signature Santal scent infused the room

For petit fours, jasmine marshmallows and pastéis de nata - both lovely