Setup a minimal working example where pods in GKE and EKS can access two buckets, one in S3 and another in GCS with no static credentials or runtime configuration. In other words, running the below commands should just work from within a pod in both clusters:
$ aws s3 ls s3://oidc-exp-s3-bucket
$ gcloud storage ls gs://oidc-exp-gcs-bucket
Communicating with Cloud APIs without the use of static, long-lived credentials from within Kubernetes requires some work even when using the CSP’s managed Kubernetes versions. In AWS, this is done through EKS Pod Identities, or IAM Roles for Service Accounts (IRSA), while in GCP this is achieved through Workload Identity Federation (WIF) for GKE.
Both IRSA and Workload Identity Federation leverage OpenID Connect (OIDC) with Kubernetes configured as an Identity Provider on both clouds’ IAM to assume roles (AWS) and impersonate service accounts (GCP). The two processes are well documented online.
However, when a Kubernetes workload running in one CSP needs to access services in another CSP, the configuration might not be as straightforward and potentially have more than way of achieving the desired result.
In particular, the path from GKE to AWS APIs is not as well documented when trying to use Kubernetes itself as the Identity Provider to AWS IAM. Both AWS’s recent blog post on the topic and the doitintl/gtoken project rely on Google (not Kubernetes) being the Identity Provider and the execution of some pre-steps or configuration of Mutating Webhooks to get workloads to “just work”.
However, it is possible to achieve keyless cross-cloud access using Kubernetes OIDC as the Identity Provider for both EKS and GKE. While there’s a good amount of pre-configuration involved, the result is very flexible and fully native. This post will demonstrate how to do so.
If you’re unfamiliar with the OIDC Authentication flow, here’s one way to think about it in simple terms:
A fully working IaC example is available on https://github.com/arturhoo/oidc-exp/ - below the main points are demonstrated. The examples start from the in-cloud access and then move to cross-cloud access.
For this exercise, two buckets will be created with a text file, one on each CSP.
resource "aws_s3_bucket" "s3_bucket" {
bucket = var.s3_bucket
}
resource "aws_s3_object" "s3_object" {
bucket = aws_s3_bucket.s3_bucket.id
key = "test.txt"
content = "Hello, from S3!"
}
resource "google_storage_bucket" "gcs_bucket" {
name = var.gcs_bucket
location = var.gcp_region
}
resource "google_storage_bucket_object" "gcs_object" {
bucket = google_storage_bucket.gcs_bucket.name
name = "test.txt"
content = "Hello, from GCS!"
}
To access AWS APIs from workloads in EKS there are primarily two options: EKS Pod Identities, or IAM Roles for Service Accounts (IRSA). Here the focus is on IRSA since the exercise focuses on Kubernetes OIDC.
All EKS clusters (including those with only private subnets and private endpoints) have a publicly available OIDC discovery endpoint, that allows other parties to verify the signature of potential JWT tokens (exposed in the URL under jwks_uri
) that have been allegedly signed by the cluster.
$ xh https://oidc.eks.eu-west-2.amazonaws.com/id/4E604436464FFCC52F8B96807F5BD5BC/.well-known/openid-configuration
{
"issuer": "https://oidc.eks.eu-west-2.amazonaws.com/id/4E604436464FFCC52F8B96807F5BD5BC",
"jwks_uri": "https://oidc.eks.eu-west-2.amazonaws.com/id/4E604436464FFCC52F8B96807F5BD5BC/keys",
"authorization_endpoint": "urn:kubernetes:programmatic_authorization",
"response_types_supported": [
"id_token"
],
"subject_types_supported": [
"public"
],
"claims_supported": [
"sub",
"iss"
],
"id_token_signing_alg_values_supported": [
"RS256"
]
}
The first step is configuring the EKS cluster to be an Identity Provider in AWS IAM:
data "tls_certificate" "cert" {
url = aws_eks_cluster.primary.identity[0].oidc[0].issuer
}
resource "aws_iam_openid_connect_provider" "oidc_provider" {
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = [data.tls_certificate.cert.certificates[0].sha1_fingerprint]
url = aws_eks_cluster.primary.identity[0].oidc[0].issuer
}
Then, a role that can read from S3 must be created. This role will have an AssumeRole
policy that uses the previously configured EKS cluster as a federated identity provider. To make it more restrictive, we define a condition on the sub
claim of the JWT token signed by the cluster to match the namespace and service account the workload itself will use.
resource "aws_iam_role" "federated_role" {
name = "oidc_exp_federated_role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
"Effect" : "Allow",
"Principal" : {
"Federated" : aws_iam_openid_connect_provider.oidc_provider.arn
},
"Action" : "sts:AssumeRoleWithWebIdentity",
"Condition" : {
"StringEquals" : {
"${local.eks_issuer}:aud" : "sts.amazonaws.com",
"${local.eks_issuer}:sub" : "system:serviceaccount:default:oidc-exp-service-account"
}
}
}
]
})
}
resource "aws_iam_policy" "s3_read_policy" {
name = "s3_read_policy"
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Effect = "Allow",
Action = ["s3:GetObject", "s3:GetObjectVersion", "s3:ListBucket"],
Resource = [
"arn:aws:s3:::${var.s3_bucket}",
"arn:aws:s3:::${var.s3_bucket}/*",
],
},
],
})
}
resource "aws_iam_role_policy_attachment" "s3_read_policy_attachment" {
role = aws_iam_role.federated_role.name
policy_arn = aws_iam_policy.s3_read_policy.arn
}
Finally, on EKS, a service account with a specific annotation is needed:
apiVersion: v1
kind: ServiceAccount
metadata:
name: oidc-exp-service-account
namespace: default
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::$AWS_ACCOUNT_ID:role/oidc_exp_federated_role
apiVersion: v1
kind: Pod
metadata:
name: aws-cli
namespace: default
spec:
containers:
- name: aws-cli
image: amazon/aws-cli
command:
- /bin/bash
- -c
- "sleep 1800"
serviceAccountName: oidc-exp-service-account
Behind the scenes, EKS is using a Mutating Webhook Controller to mount an OIDC token signed by the cluster into the pod through a volume projection and setting an environment variable for AWS_WEB_IDENTITY_TOKEN_FILE
and AWS_ROLE_ARN
, which in turn are used by the AWS SDK as auto configuration. We can see the modifications made to the pod there weren’t originally present in the pod definition above:
$ kubectl --context aws get pod aws-cli -o yaml
apiVersion: v1
kind: Pod
metadata:
name: aws-cli
namespace: default
...
spec:
...
env:
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: AWS_DEFAULT_REGION
value: eu-west-2
- name: AWS_REGION
value: eu-west-2
- name: AWS_ROLE_ARN
value: arn:aws:iam::<REDACTED>:role/oidc_exp_federated_role
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
image: amazon/aws-cli
...
name: aws-cli
volumeMounts:
- mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
name: aws-iam-token
readOnly: true
...
...
volumes:
- name: aws-iam-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
...
...
status:
...
phase: Running
This allows reads from S3 without any other changes from within the pod:
$ kubectl --context aws exec -it gcloud-cli -- bash
bash-4.2# aws s3 ls s3://oidc-exp-s3-bucket
2024-03-17 18:29:42 15 test.txt
Success! 1/4 complete.
As previously mentioned, the golden path is through Workload Identity Federation (WIF) for GKE. When Workload Identity is Enabled for a GKE cluster, an implicit Workload Identity Pool is created with the format PROJECT_ID.svc.id.goog
, and the GKE Issuer URL configured behind the scenes.
resource "google_container_cluster" "primary" {
...
workload_identity_config {
workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
}
}
We will also need a GCP IAM Service account with the correct permissions to read from the bucket:
resource "google_service_account" "default" {
account_id = "oidc-exp-service-account"
display_name = "OIDC Exp Service Account"
}
resource "google_storage_bucket_iam_binding" "viewer" {
bucket = var.gcs_bucket
role = "roles/storage.objectViewer"
members = ["serviceAccount:${google_service_account.default.email}"]
}
In Kubernetes, a Service Account must be created with a special annotation that will allow the GCP SDK to perform a multi-step process that intercept calls to GCP APIs and exchanges a service account token generated on-demand by the cluster for a GCP access token, which is then used to access the APIs. For this reason, contrary to EKS, no service account volume projection takes place.
apiVersion: v1
kind: ServiceAccount
metadata:
name: oidc-exp-service-account
namespace: default
annotations:
iam.gke.io/gcp-service-account: oidc-exp-service-account@$GCP_PROJECT_ID.iam.gserviceaccount.com
For the previously mentioned token exchange to take place, the GCP IAM Service Account must have the federated K8s service account configured to assume it:
resource "google_service_account_iam_binding" "service_account_iam_binding" {
service_account_id = google_service_account.default.name
role = "roles/iam.workloadIdentityUser"
members = [
"serviceAccount:${var.gcp_project_id}.svc.id.goog[default/oidc-exp-service-account]",
]
}
The pod simply uses the service account:
apiVersion: v1
kind: Pod
metadata:
name: gcloud-cli
namespace: default
spec:
containers:
- name: gcloud-cli
image: gcr.io/google.com/cloudsdktool/google-cloud-cli:alpine
command:
- /bin/bash
- -c
- "sleep 1800"
serviceAccountName: oidc-exp-service-account
With the pod online, we can test our GCS access:
$ kubectl --context gke exec -it gcloud-cli -- bash
gcloud-cli:/# gcloud storage ls gs://oidc-exp-gcs-bucket
gs://oidc-exp-gcs-bucket/test.txt
Success! 2/4 complete.
Things become interesting now! As previously mentioned, most of the documentation available online is about Google, not Kubernetes (GKE), being the Identity Provider. However, the GKE cluster itself can be used as the Identity Provider, like how EKS was used in the EKS to AWS section.
The first step is to configure the GKE cluster as an Identity Provider in AWS IAM:
locals {
gke_issuer_url = "container.googleapis.com/v1/projects/${var.gcp_project_id}/locations/${var.gcp_zone}/clusters/oidc-exp-cluster"
}
resource "aws_iam_openid_connect_provider" "trusted_gke_cluster" {
url = "https://${local.gke_issuer_url}"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["08745487e891c19e3078c1f2a07e452950ef36f6"]
}
Similar to AWS, all GKE clusters also has a publicly available OIDC discovery endpoint:
$ xh https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster/.well-known/openid-configuration
{
"issuer": "https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster",
"jwks_uri": "https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster/jwks",
"response_types_supported": [
"id_token"
],
"subject_types_supported": [
"public"
],
"id_token_signing_alg_values_supported": [
"RS256"
],
"claims_supported": [
"iss",
"sub",
"kubernetes.io"
],
"grant_types": [
"urn:kubernetes:grant_type:programmatic_authorization"
]
}
We will want to assume the same role that the pod in EKS assumed, therefore, we just need to update the AssumeRole
policy to include the following statement:
{
"Effect" : "Allow",
"Principal" : {
"Federated" : aws_iam_openid_connect_provider.trusted_gke_cluster.arn
},
"Action" : "sts:AssumeRoleWithWebIdentity",
"Condition" : {
"StringEquals" : {
"${local.gke_issuer_url}:sub" : "system:serviceaccount:default:oidc-exp-service-account",
}
}
},
At this point, the IAM has been configured and all that is left is configure the Pod appropriately. While we could install the Mutating Webhook Controller that AWS uses, it is also trivial to setup the service account volume projection and define the expected variables for AWS SDK to auto configuration:
apiVersion: v1
kind: Pod
metadata:
name: aws-cli
namespace: default
spec:
containers:
- name: aws-cli
image: amazon/aws-cli
command:
- /bin/bash
- -c
- "sleep 1800"
volumeMounts:
- mountPath: /var/run/secrets/tokens
name: oidc-exp-service-account-token
env:
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: "/var/run/secrets/tokens/oidc-exp-service-account-token"
- name: AWS_ROLE_ARN
value: "arn:aws:iam::$AWS_ACCOUNT_ID:role/oidc_exp_federated_role"
serviceAccountName: oidc-exp-service-account
volumes:
- name: oidc-exp-service-account-token
projected:
sources:
- serviceAccountToken:
path: oidc-exp-service-account-token
expirationSeconds: 86400
audience: sts.amazonaws.com
Here’s a sample decoded JWT token that is mounted on the pod and sent to AWS IAM, which will verify the signature and claims previously configured:
{
"aud": [
"sts.amazonaws.com"
],
"exp": 1710979065,
"iat": 1710892665,
"iss": "https://container.googleapis.com/v1/projects/$GCP_PROJECT_ID/locations/$GCP_ZONE/clusters/oidc-exp-cluster",
"kubernetes.io": {
"namespace": "default",
"pod": {
"name": "aws-cli",
"uid": "bcf6d914-7ce5-4332-a417-510b3cbc144a"
},
"serviceaccount": {
"name": "oidc-exp-service-account",
"uid": "c56d2a4c-2622-41e1-8c7e-e3ab6eba39b5"
}
},
"nbf": 1710892665,
"sub": "system:serviceaccount:default:oidc-exp-service-account"
}
At this point the pod is ready to be launched and the S3 bucket can be listed without any further configuration:
$ kubectl --context gke exec -it aws-cli -- bash
bash-4.2# aws s3 ls s3://oidc-exp-s3-bucket
2024-03-17 18:29:42 15 test.txt
Success! 3/4 complete.
The final configuration is from EKS to GCP. While GKE clusters are configured as OIDC providers in the project-default Workload Identity Pool, we can’t add custom providers there. Therefore, we need to create a new pool:
locals {
workload_identity_pool_id = "oidc-exp-workload-identity-pool"
}
resource "google_iam_workload_identity_pool" "pool" {
workload_identity_pool_id = local.workload_identity_pool_id
}
Then, we need to add the EKS cluster as a provider. Note that we’re using the same OIDC issuer URL as we did in the EKS to AWS section.
resource "google_iam_workload_identity_pool_provider" "trusted_eks_cluster" {
workload_identity_pool_id = google_iam_workload_identity_pool.pool.workload_identity_pool_id
workload_identity_pool_provider_id = "trusted-eks-cluster"
attribute_mapping = {
"google.subject" = "assertion.sub"
}
oidc {
issuer_uri = aws_eks_cluster.primary.identity[0].oidc[0].issuer
}
}
Finally, we want the pods in EKS to be able to impersonate the GCP IAM Service Account we previously created for the GKE to GCP path. Therefore, we add a new member to the existing policy binding:
resource "google_service_account_iam_binding" "binding" {
service_account_id = google_service_account.default.name
role = "roles/iam.workloadIdentityUser"
members = [
"principal://iam.googleapis.com/projects/${data.google_project.project.number}/locations/global/workloadIdentityPools/${local.workload_identity_pool_id}/subject/system:serviceaccount:default:oidc-exp-service-account",
"serviceAccount:${var.gcp_project_id}.svc.id.goog[default/oidc-exp-service-account]",
]
}
Different from the GKE to GCP path, there’s no magic interception of requests. The Kubernetes crafted JWT token will be used to authenticate with the GCP APIs. Therefore, the pod must be configured to both mount the K8s Service Account token and set the CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE
environment variable to a JSON file that informs the GCP SDK how to use it and what service account to impersonate. Normally, this JSON can be constructed using the gcloud iam workload-identity-pools create-cred-config
command. However, since the structure is static, we can simply define it ahead of time as a ConfigMap:
apiVersion: v1
data:
credential-configuration.json: |-
{
"type": "external_account",
"audience": "//iam.googleapis.com/projects/$GCP_PROJECT_NUMBER/locations/global/workloadIdentityPools/oidc-exp-workload-identity-pool/providers/trusted-eks-cluster",
"subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
"token_url": "https://sts.googleapis.com/v1/token",
"credential_source": {
"file": "/var/run/service-account/token",
"format": {
"type": "text"
}
},
"service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/oidc-exp-service-account@$GCP_PROJECT_ID.iam.gserviceaccount.com:generateAccessToken"
}
kind: ConfigMap
metadata:
name: oidc-exp-config-map
namespace: default
And the Pod:
apiVersion: v1
kind: Pod
metadata:
name: gcloud-cli
namespace: default
spec:
containers:
- name: gcloud-cli
image: gcr.io/google.com/cloudsdktool/google-cloud-cli:alpine
command:
- /bin/bash
- -c
- "sleep 1800"
volumeMounts:
- name: token
mountPath: "/var/run/service-account"
readOnly: true
- name: workload-identity-credential-configuration
mountPath: "/var/run/secrets/tokens/gcp-ksa"
readOnly: true
env:
- name: CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE
value: "/var/run/secrets/tokens/gcp-ksa/credential-configuration.json"
serviceAccountName: oidc-exp-service-account
volumes:
- name: token
projected:
sources:
- serviceAccountToken:
audience: https://iam.googleapis.com/projects/$GCP_PROJECT_NUMBER/locations/global/workloadIdentityPools/oidc-exp-workload-identity-pool/providers/trusted-eks-cluster
expirationSeconds: 3600
path: token
- name: workload-identity-credential-configuration
configMap:
name: oidc-exp-config-map
And without any further configuration, the pod can access the GCS bucket:
$ kubectl --context aws exec -it gcloud-cli -- bash
gcloud-cli:/# gcloud storage ls gs://oidc-exp-gcs-bucket
gs://oidc-exp-gcs-bucket/test.txt
Success! 4/4 complete of the scenarios have been successful!
While the above example for GKE to GCP is the recommended way to access GCP resources from Kubernetes, after seeing how the EKS to GCP access is done, one is left wondering if we can bypass the magic interception of requests altogether! In fact, that is definitely possible and actually results in an implementation that is even more consistent across the two clouds.
The first step is to remove the workload_identity_config
and workload_metadata_config
configurations from the GKE Cluster and Node Pool configurations in Terraform. Then, a new google_iam_workload_identity_pool_provider
resource for the GKE cluster must be created:
resource "google_iam_workload_identity_pool_provider" "trusted_gke_cluster" {
workload_identity_pool_id = google_iam_workload_identity_pool.pool.workload_identity_pool_id
workload_identity_pool_provider_id = "trusted-gke-cluster"
attribute_mapping = {
"google.subject" = "assertion.sub"
}
oidc {
issuer_uri = local.gke_issuer_url
}
}
Since we aren’t relying on GCP’s magic, we can also remove the GKE annotation from the K8s service account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: oidc-exp-service-account
namespace: default
Finally, the Pod spec for gcloud-cli
becomes identical to the EKS one, which requires the creation of the ConfigMap.
arm64
and amd64
), allowing any node to be selected for housing that workload and letting the container runtime itself running in that node fetch the correct image.
However, there might be certain cases where a particular Pod contains container images built for a single architecture (for example, amd64
) - in such cases, Kubernetes won’t prevent that workload from being scheduled on an arm64
node, unless the cluster administrator and/or the application owner have made extra configurations. What are those configurations?
The first possibility is for the cluster administrator to have decided to only run amd64
nodes. That is likely to be largely compatible with all open source tools and images, as amd64
remains the default architecture in the cloud, and most build pipelines will target it. In other words, simply by running only the amd64
architecture, a cluster administrator will probably never face issues around container image compatibility.
The other possibility is for the cluster administrator to put a taint on all arm64
nodes in the cluster. In fact, on GKE this is done by default:
By default, GKE schedules workloads only to x86-based nodes—Compute Engine machine series with Intel or AMD processors—by placing a taint (kubernetes.io/arch=arm64:NoSchedule) on all Arm nodes. This taint prevents x86-compatible workloads from being inadvertently scheduled to your Arm nodes
Outside of GKE, a cluster administrator might do the same to the node groups/pools that include ARM instances. In those cases, the pods will need a toleration to be able to run on such nodes.
Finally, application owners also have the possibility to use node selectors to ensure their workloads are only scheduled on amd64
nodes, by targeting the default label kubernetes.io/arch
with value amd64
.
As long as the cluster administrator taints nodes that are part of a node group that might include arm64
instances, we can automate the inclusion of tolerations through a Mutating Admission Controller. This controller will intercept all pod creation events, check the supported architectures for all containers specified in the spec, and include a toleration if all images are multiarch (arm64
and amd64
compatible):
func DoesPodSupportArm64(cache Cache, pod *corev1.Pod) bool {
supported := true
for _, container := range pod.Spec.Containers {
if !DoesImageSupportArm64(cache, container.Image) {
supported = false
}
}
return supported
}
A proof-of-concept can be found on arturhoo/k8smultiarcher. In particular, regclient/regclient is used to fetch all the supported platforms/architectures through the Manifests V2 API, which does not require downloading the full image. The project also incorporates a cache and a fail-open mechanism (in case of failures or timeout, no toleration is added).
One important implementation detail is that the toleration added is for a multi-architecture node group, not a arm64
-only one. This is particularly important when running a cluster autoscaling strategy (e.g. Karpenter) that taps into several instance types and spot offerings, i.e. at certain times it might be cheaper to run amd64
nodes.
A full end-to-end example can be found on the test suite for the project.
]]>The first line of defence should be configuring the API server flags --max-requests-inflight
and --max-mutating-requests-inflight
, followed by configuring API Priority and Fairness, which allows for fine grained requests to be deprioritised (and ultimately rate limited) relative to other requests. Finally, the alpha Event Rate Limit can put a ceiling on the number of requests per second sent to the API server on a given namespace, for example.
Thinking about a final line of defence, I decided to explore implementing an admission webhook that would be configured (through a ValidatingWebhookConfiguration
) to intercept all pod creation requests and enforce a rate limit.
var limiter = rate.NewLimiter(rate.Every(10*time.Second), 1)
func validatingHandler(c *gin.Context) {
var review admissionv1.AdmissionReview
if err := c.Bind(&review); err != nil {
return
}
allowed := limiter.Allow()
var status, msg string
if allowed {
status = metav1.StatusSuccess
} else {
status = metav1.StatusFailure
msg = "rate limit exceeded"
}
review.Response = &admissionv1.AdmissionResponse{
UID: review.Request.UID,
Allowed: allowed,
Result: &metav1.Status{
Status: status,
Message: msg,
},
}
c.JSON(200, review)
}
Using golang.org/x/time/rate
, we keep a limiter that allows one request every 10 seconds. If the request is allowed, we return StatusSuccess
, otherwise we return a StatusFailure
which will prevent the pod from being created.
The configuration itself, defines a rule that narrows the scope to only pod creation with a ‘fail open’ failure policy:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: k8slimiter-pod-creation
annotations:
cert-manager.io/inject-ca-from: k8slimiter/k8slimiter-certificate
webhooks:
- name: k8slimiter-pod-creation.k8slimiter.svc
admissionReviewVersions:
- v1
clientConfig:
service:
name: k8slimiter-service
namespace: k8slimiter
path: "/validate"
rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
failurePolicy: Ignore
sideEffects: None
With those in place, creating pods in quick succession leads to the expected rate limiting behaviour:
$ kubectl run "tmp-pod-$(date +%s)" --restart Never --image debian:12-slim -- sleep 1
pod/tmp-pod-1698005111 created
$ kubectl run "tmp-pod-$(date +%s)" --restart Never --image debian:12-slim -- sleep 1
Error from server: admission webhook "k8slimiter-pod-creation.k8slimiter.svc" denied the request: rate limit exceeded
A full working example can be found on arturhoo/k8slimiter, which leverages Gin and cert-manager
to achieve a minimal and straightforward admission webhook setup.
Kubernetes cluster administrators sometimes perform maintenance operations in nodes, such as hardware swaps, Kernel and Kubernetes version upgrades. On the other hand, application developers, might be interested in ensuring that their applications remain available during those maintenance operations. This is usually achieved through Disruption Budgets - here’s what the official docs have to say about them:
[They] limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.
To demonstrate how they’re utilized, let’s start with a simple Deployment
of two nginx
replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
By default, old pods will be replaced by new ones following the RollingUpdate
strategy, which gradually replaces old pods with new ones. Through the maxSurge
and maxUnavailable
options, an application developer can control how those pods are replaced. In the example below, the number of available replicas should never be lower than the deployment size itself and at most one extra pod can be created:
...
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
...
Finally, for the sake of the example, let’s ensure that two nginx
pods never run on the same node through a podAntiAffinity
rule:
...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
...
At this point, the application developer can rollout new versions of their deployment safely. However, the availability of the application is still at risk of being ‘disrupted’ by maintenances that affect the pods. The cluster administrator can drain all nodes in the cluster, leaving all pods of the deployment in the pending state:
$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deployment 0/2 2 0 110m
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-855f7d96b9-8d48g 0/1 Pending 0 3m22s
nginx-deployment-855f7d96b9-bgvzb 0/1 Pending 0 3m18s
This is where Disruption Budgets are useful. The application developer might decide to create the following PodDisruptionBudget
, which has similar semantics to the RollingUpdate
strategy of the Deployment
:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
maxUnavailable: 0
selector:
matchLabels:
app: nginx
However, setting a Disruption Budget with maxUnavailable: 0
has important implications:
If you set
maxUnavailable
to 0% or 0, or you setminAvailable
to 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics ofPodDisruptionBudget
.
As suggested by the PDB docs, if we try to drain a node where a pod protected by a PDB with such characteristics exists, it will never complete. Take a cluster with four nodes as an example:
$ kuebctl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 58s v1.25.3
kind-worker Ready <none> 34s v1.25.3
kind-worker2 Ready <none> 34s v1.25.3
kind-worker3 Ready <none> 34s v1.25.3
kind-worker4 Ready <none> 33s v1.25.3
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-bdccd6b79-84zhj 1/1 Running 0 41s 10.244.3.2 kind-worker <none> <none>
nginx-deployment-bdccd6b79-hqbgx 1/1 Running 0 41s 10.244.1.2 kind-worker3 <none> <none>
If we try to drain one of the nodes, the process will never complete:
$ kubectl drain --ignore-daemonsets kind-worker
node/kind-worker cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-dkp4b, kube-system/kube-proxy-c66b6
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
...
However, in some situations, such as the one above, a cluster administrator can successfully drain the node by issuing a restart of the application blocking the drain:
$ kubectl rollout restart deployment nginx-deployment
deployment.apps/nginx-deployment restarted
Here’s a full example:
This creates a dilemma: Disruption Budgets with those characteristics are explicitly communicating their application’s “desire” to never be drained. On the other hand, the application’s strategy allows the application to be restarted while still respecting its RollingUpdate
spec.
In practical terms, the cluster administrator can perform the necessary maintenances while respecting the availability characteristics from the application. But should cluster administrators restart applications in the cluster without the application developer consent?
In some circumstances, for example companies with SRE teams maintaining production clusters and Product teams developing applications, such operations could be performed, as there might not be strict terms of service in place.
I discussed this situation on the Kubernetes sig-cli Slack channel, and a few folks were receptive to the idea of giving Kubernetes users a way to automatically workaround impossible drains.
The drain logic only lives in kubectl
, which calls the Eviction API for each pod running on the node. My first idea was to introduce a new flag to kubectl drain
that would trigger rollout restart
of the blocking controllers (Deployment
, StatefulSet
, ReplicaSet
) when the Eviction API returned a 429 response.
When proposing this in the sig-cli fortnightly meeting, we concluded that the drain behavior sufficiently meets the existing semantics and that impossible drain situations are opportunities to educate application developers of the implications of those restrictive PDBs. For example, OpenShift’s Kubernetes Controller Manager Operator has alerts configured for those restrictive PDBs:
Standard workloads should have at least one pod more than is desired to support API-initiated eviction. Workloads that are at the minimum disruption allowed level violate this and could block node drain. This is important for node maintenance and cluster upgrades.
Nonetheless, the group suggested writing this up (this blog post) and presenting it to the sig-apps and sig-api-machinery groups. A potential proposal could be to introduce new functionality to the Eviction API, alongside a new field to the PodDisruptionBudget
spec that would trigger the update of the blocking controller. For example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
rolloutRestartAllowed: true
maxUnavailable: 0
selector:
matchLabels:
app: nginx
Here, the application developer would be explicitly granting permission to the cluster administrator to perform updates to their deployment or applications.
]]>Kafka is a distributed message system that excels in high throughput architectures with many listeners. However, Kafka is also often used as job queue solution and, in this context, its head-of-line blocking characteristics can lead to increased latency. Let’s build an experiment to explore it in practice.
Messages are sent to topics in Kafka which are hashed and assigned to partitions - one topic has one or more partitions. Multiple consumers can read from a topic by forming a Consumer Group, with each one being automatically assigned a subset of the partitions for a given topic.
No two consumers from the same Consumer Group can read from the same partition. Therefore, to avoid idle consumers, a topic must have at least as many partitions as there are consumers.
At this point, head-of-line blocking might be starting to make sense. If Consumer 0
takes a long time to perform the work associated with a message (either because the work is expensive or because it is under resource pressure), all other pending messages in the partitions it is responsible for will remain pending.
Side note: where Kafka message streaming capabilities really shine is when you have many subscribers. A new consumer group can be formed and process the same messages as the original group, on its own pace. At this point, it is no longer a worker queue in the traditional sense.
This is in contrast to other solutions like RabbitMQ or beanstalkd where, regardless of the number of consumers, pending jobs will be served to the first consumer that asks for one on a given queue.
Let’s take a look at beanstalkd, which I have introduced in a previous blog post:
With beanstalkd, jobs are sent to tubes. Consumers simply connect to the server and reserve jobs from a given tube. For a given beanstalkd server, jobs are given out in the same order they were enqueued.
Here, head-of-line blocking is no longer a concern, as jobs will continue to be served from the queue to available consumers even if a particular consumer is slow. Contrary to Kafka with multiple consumer groups, a job in a tube cannot be served to two consumers in the happy path. When reservations times out, beanstalkd will requeue that job. These are traditional work queue primitives.
In this experiment, each job represents a unit of work: a synchronous sleep. The sleep duration is determined by the producer that creates 100 jobs in total. Every job has a sleep value of 0, except for 4 of them which have a sleep value of 10s.
beanstalkd_tube = beanstalkd.tubes[BEANSTALKD_MAIN_TUBE]
100.times do |i|
msg = (i % 25).zero? ? 10 : 0
beanstalkd_tube.put(msg.to_s)
kafka_producer.produce(
topic: KAFKA_MAIN_TOPIC,
payload: msg.to_s,
key: "key-#{i}"
)
end
If we only had a single consumer, the total time to complete all jobs would be at least 40s, as that consumer would sleep for 10s four times. If we had an unlimited number of consumers, the minimum total time would be 10s, as at least four consumers would have to sleep for 10s in parallel.
Back to the experiment, both Kafka and beanstalkd are set up, each with five consumers. The Kafka topic is configured with 10 partitions, therefore, each Kafka consumer is responsible for two partitions, in a single consumer group configuration. Below are the implementations for each consumer type:
consumer.subscribe(KAFKA_MAIN_TOPIC)
consumer.each do |msg|
duration = msg.payload.to_i
log.info 'Going to sleep' if duration.positive?
sleep(msg.payload.to_i)
producer.produce(
topic: KAFKA_COUNTER_TOPIC,
payload: 'dummy'
)
end
main_tube = beanstalkd.tubes[BEANSTALKD_MAIN_TUBE]
counter_tube = beanstalkd.tubes[BEANSTALKD_COUNTER_TUBE]
loop do
job = main_tube.reserve
duration = job.body.to_i
log.info 'Going to sleep' if duration.positive?
sleep(duration)
counter_tube.put('dummy')
job.delete
end
After sleeping, consumers produce a dummy message to a different topic/tube, which is used by an out of bound watcher process that keeps track of global progress. Each watcher process starts the clock when the first dummy message is received and stops i when the 100th message is received.
To kickstart the experiment, we start both Kafka and beanstalkd, five consumers for each and the two watcher processes:
$ docker-compose up
queue-beanstalkd-watcher-1 | I, [2023-03-19T22:03:59] Started beanstalkd watcher
queue-beanstalkd-consumer-1 | I, [2023-03-19T22:04:00] Connected to beanstalkd
queue-beanstalkd-consumer-3 | I, [2023-03-19T22:04:01] Connected to beanstalkd
queue-beanstalkd-consumer-4 | I, [2023-03-19T22:04:01] Connected to beanstalkd
queue-beanstalkd-consumer-5 | I, [2023-03-19T22:04:02] Connected to beanstalkd
queue-beanstalkd-consumer-2 | I, [2023-03-19T22:04:02] Connected to beanstalkd
queue-kafka-define-topic-1 | I, [2023-03-19T22:04:11] Topics created!
queue-kafka-define-topic-1 exited with code 0
queue-kafka-watcher-1 | I, [2023-03-19T22:04:12] Started Kafka watcher
queue-kafka-consumer-2 | I, [2023-03-19T22:04:13] Subscribed to kafka topic
queue-kafka-consumer-1 | I, [2023-03-19T22:04:14] Subscribed to kafka topic
queue-kafka-consumer-4 | I, [2023-03-19T22:04:14] Subscribed to kafka topic
queue-kafka-consumer-5 | I, [2023-03-19T22:04:14] Subscribed to kafka topic
queue-kafka-consumer-3 | I, [2023-03-19T22:04:15] Subscribed to kafka topic
At this point, without no messages having been produced, we can inspect the topology of Kafka partitions and consumers:
$ kafka-consumer-groups.sh --describe --group main-group --bootstrap-server localhost:9092
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
main-group main 8 - 0 - rdkafka-c12c408c-3da7-48b8-922e-17053059b828 /172.19.0.12 rdkafka
main-group main 9 - 0 - rdkafka-c12c408c-3da7-48b8-922e-17053059b828 /172.19.0.12 rdkafka
main-group main 0 - 0 - rdkafka-57fb04b5-4c10-4403-894c-587bb95a285e /172.19.0.15 rdkafka
main-group main 1 - 0 - rdkafka-57fb04b5-4c10-4403-894c-587bb95a285e /172.19.0.15 rdkafka
main-group main 2 - 0 - rdkafka-686169bc-eef9-498b-a7ca-a243c401f4bd /172.19.0.13 rdkafka
main-group main 3 - 0 - rdkafka-686169bc-eef9-498b-a7ca-a243c401f4bd /172.19.0.13 rdkafka
main-group main 6 - 0 - rdkafka-98349f3c-f097-450c-a1a1-82c3adef1fd3 /172.19.0.14 rdkafka
main-group main 7 - 0 - rdkafka-98349f3c-f097-450c-a1a1-82c3adef1fd3 /172.19.0.14 rdkafka
main-group main 4 - 0 - rdkafka-87de172e-6759-46d5-b788-e27e5fb52e02 /172.19.0.11 rdkafka
main-group main 5 - 0 - rdkafka-87de172e-6759-46d5-b788-e27e5fb52e02 /172.19.0.11 rdkafka
main-group counter 0 - 0 - rdkafka-b6c8a89e-cb22-4872-85c5-57cf5da68756 /172.19.0.10 rdkafka
As seen above, each consumer has been assigned two partitions, and all 10 are empty. Time to produce the 100 messages:
$ ruby producer.rb
And wait for the results:
queue-beanstalkd-consumer-1 | I, [2023-03-19T22:04:28] Going to sleep
queue-beanstalkd-watcher-1 | I, [2023-03-19T22:04:28] Started beanstalkd clock!
queue-beanstalkd-consumer-3 | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-consumer-1 | I, [2023-03-19T22:04:28] Going to sleep
queue-beanstalkd-consumer-5 | I, [2023-03-19T22:04:28] Going to sleep
queue-beanstalkd-consumer-4 | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-consumer-2 | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-consumer-5 | I, [2023-03-19T22:04:28] Going to sleep
queue-kafka-watcher-1 | I, [2023-03-19T22:04:28] Started Kafka clock!
queue-beanstalkd-watcher-1 | I, [2023-03-19T22:04:38] beanstalkd took 10s to complete!
queue-kafka-consumer-2 | I, [2023-03-19T22:04:38] Going to sleep
queue-kafka-watcher-1 | I, [2023-03-19T22:04:48] Kafka took 20s to complete!
The full experiment is available on github.com/arturhoo/kafka-experiment.
From the watcher times above, we can clearly see a difference between the two setups: Kafka’s took double the amount of time to process all 100 messages. The head-of-line blocking behavior, however, has further implications. By capturing the timestamp where each nth job is completed (as measured by the watcher), we can plot the global process for both setups:
As seen above, the beanstalkd setup was able to process 96 out of the 100 messages in less than one second. The Kafka setup, however, had two long 10s periods of time where no messages was processed - that is because there was at one consumer (queue-kafka-consumer-2
) who was assigned two messages with a sleep duration of 10s.
This is in contrast with the beanstalkd setup, where four consumers slept in parallel while the fifth consumer (beanstalkd-consumer-2
) was able to empty the queue, effectively working more than its peers.
–
Thanks @javierhonduco for reviewing this post.
]]>Traditionally, reverse proxies are configured with a static set of rules which determines the correct upstream/backend. When put in front of a sharded architecture, they might route traffic to the appropriate backend based on a subdomain (e.g., us-east-1.example.com
) or a path (e.g., example.com/europe-west-2
).
This can be particularly common if you have the same application deployed in two different jurisdictions (data and control plane). Most times it is enough to have customers use the unambiguous URL for interacting with an application - in those cases a global reverse proxy (or API Gateway) might even not exist.
However, sometimes it might be desirable (or necessary) to have a unique hostname that serves all customers. For example, you might want POST request to be sent to a short URL, using JSON Web Tokens for authorization. Or you might be creating a Github App that can only configure a single webhook URL to receive events.
In such situations, for every request, we need to look up the correct backend for that request based on its contents (headers, body, query parameters) before dispatching it. The static rules from traditional reverse proxies aren’t enough in this case.
This can be solved quite easily with Caddy. Here are the components in our proof of concept:
waitrose
served by the European backendwalmart
served by the American backendFirst, we will populate our shard look-up table in Redis:
> SET walmart 'us-east-1:8080'
> SET waitrose 'europe-west-2:8080'
In this example, a request will be sent on behalf of customer waitrose
. Since the customer information will be embedded in the JTW, we need to a way to generate a token. First, we will generate asymmetric keys (symmetric would also have worked):
$ openssl genrsa -out cert/id_rsa 2048
$ openssl rsa -in cert/id_rsa -pubout > cert/id_rsa.pub
Next, we leverage Ruby’s conciseness to generate the JWT:
require 'openssl'
require 'jwt'
priv = OpenSSL::PKey::RSA.new(File.open('cert/id_rsa'))
JWT.encode({customer: 'waitrose'}, priv, 'RS256')
Brilliant, we have everything we need to send a request. Next, we implement our own Caddy module that allows for the dynamic selection of a backend. Here’s a brief description of its behaviour:
Authorization
header using the Bearer schemashard.upstream
- this variable will be exposed in the Caddyfile
X-Customer
(more on it later)And the code:
func (m JWTShardRouter) ServeHTTP(w http.ResponseWriter, r *http.Request, next caddyhttp.Handler) error {
authHeader := r.Header.Get("Authorization")
tokenStr := strings.TrimPrefix(authHeader, "Bearer ")
claims := ParseJWT(tokenStr)
customer, _ := claims["customer"].(string)
r.Header.Set("X-Customer", customer)
shard, _ := rdb.Get(ctx, customer).Result()
caddyhttp.SetVar(r.Context(), "shard.upstream", shard)
return next.ServeHTTP(w, r)
}
Finally, we use the registered shard.upstream
variable in our Caddyfile
{
order jwt_shard_router before method
}
http://localhost:5000 {
jwt_shard_router
reverse_proxy {
to {http.vars.shard.upstream}
}
}
Only the backend server left now. Since this is just a proof of concept, it doesn’t do much. It replies to requests coming to /
and leverages the fact that Caddy has already decoded the customer from the JWT and put that information in the X-Customer
header. Knowing the customer, it greets them in the response while including the shard name (provided through an environment variable) in the X-Shard
header. This response from backend server demonstrates that the process works end-to-end.
func main() {
r := gin.Default()
r.GET("/", func(c *gin.Context) {
customer := c.Request.Header.Get("X-Customer")
c.Header("X-Shard", os.Getenv("SHARD"))
c.JSON(http.StatusOK, gin.H{
"message": fmt.Sprintf("Hello %s!", customer),
})
})
r.Run()
}
Time to test our POC. We spin up our patched Caddy server, Redis and the two backend servers:
$ docker-compose up
...
$ docker-compose ps
SERVICE COMMAND PORTS
caddy "/caddy run" 0.0.0.0:5000->5000/tcp
europe-west-2 "/upstream"
redis "docker-entrypoint.s…" 6379/tcp
us-east-1 "/upstream"
And issue the request:
$ http localhost:5000 -A bearer -a $WAITROSE_TOKEN
HTTP/1.1 200 OK
Content-Length: 29
Content-Type: application/json; charset=utf-8
Date: Sun, 12 Mar 2023 12:00:00 GMT
Server: Caddy
X-Shard: europe-west-2
{
"message": "Hello waitrose!"
}
Success! A full example is available on github.com/arturhoo/caddyshardrouter.
I’ve chosen Caddy as it has been in my radar for a while for its focus on developer experience - as seen above, the dynamic selection of upstream servers was made possible in less than 80 lines of code. It has also had the opportunity to mature with the v2 rewrite.
Being written in Go allows us to generate a self-contained binary that can easily be placed in a distroless image. To further exemplify Caddy’s focus on devx, the xcaddy
utility allows us to build a patched Caddy server with our module through a single command.
Here are some potential alternatives:
Tempo de cozimento: 50 minutos.
Instruções:
Emprate com queijo parmesão ralado e cebolinha.
]]>We were the first guests to arrive and were sat in the leftmost corner of the bar - this allowed us to be very close to the action and be served by Raz himself for the main courses. We were immediately served a refreshing gin and tonic granita with peanuts. We opted for the Yarden Blanc de Blancs, which I hadn’t tried yet on my trip. It paired well with the snacks that were served while the remainder of the guests were still arriving: beef tartare, carrot crisps, asparagus tartlet, chickpea panisse and mini sandwiches made from dehydrated carrot juice
Service was outstanding throughout the evening, with every member of staff greeting us at some point and food restrictions being acknowledged when needed. The restaurant also serves Acqua Panna mineral still and sparkling free of charge - a big plus in my books.
Moving on, we had yellow tail sashimi served with raspberry, tomato and edible flower. The texture and depth of flavor of the fish were impressive and my friend remarked it as being the best raw fish he had ever eaten
Followed by the fish, were the outstanding Parkerhouse Rolls served with whipped tomato cream - I tried saving some for the courses to come, but it was far too good. Over the next hour, we would observe the rigorous precision, from presentation to technique and timing, of Raz and his team prepare the next three main courses. A Chateau Golan red seemed appropriate
After the artichoke gnudi and the grilled grouper with aspargus, the main stars of the night - local ducks - left the oven and were carved up in two dishes: the magret, rare with a crispy skin, sliced and served with walnuts; and the confit prepared as a rillette. Both were fantastic, but the subtleness of the rillette might have been the highlight of the night
In the interlude to the dessert courses, aged goat cheese was served. The restaurant uses local ingredients only - which reminded me of Domestic in Aarhus
For the sweet courses, those local ingredients were combined into one of the best dessert courses I’ve had in a restaurant: artichoke, lemon and cardamom. They were followed by a cloud dense fennel parfait
As my friend and I recollected the past two and half hours, we acknowledged the perfection of both timing and quantity in the restaurant’s tasting menu. OCD’s simple and modern decor extends to the toilets, which are equipped with locally produced amenities
]]>We sat in a corner table facing peaceful Green Park. We opted for the £48 set lunch menu, starting with finger-food style vegetables, a strawberry gazpacho, bread and cold meats
For starters, Laura went with the beef tartare and I opted for gouda custard with wild garlic - both were gentle on the palate
We continued with the cod and the chicken with spätzle - they were served in a nice covered bowl that complemented the restaurant décor
Service was friendly, attentive and not overindulging, which is nice for a Sunday afternoon. The oak staircase connects the three floors of the restaurant. For the last course, both of us decided for the almond and apricot soufflé with osmanthus ice cream
The toilets have Le Labo amenities and the brand’s signature Santal scent infused the room
For petit fours, jasmine marshmallows and pastéis de nata - both lovely
]]>Take a book, give a book - or both
Around the corner, let the smell of incenses guide you
They even have a stylish shuffleboard
Laura got an Unveiled Spritz (gin-based) and I got the Our Whisky Sour
For dinner, we headed to Santo Remedio. We started with guacamole (which was exceptional) and tortilla chips
We sat at the second floor, which had nice views of the street. The restaurant also has a unique personality to it
To continue, we went with Ox Tongue Gorditas and the Octopus Tikin Xik - both were delicious and with a nice kick. The extra pico de gallo on the side was very hot
]]>