After watching our monthly AWS bill climb steadily as our Kubernetes workloads grew, I knew we needed a smarter approach to node provisioning. Traditional cluster autoscaling was burning through our budget with over-provisioned nodes sitting idle, while still taking forever to scale up when I needed capacity.
I’d heard of Karpenter, but I hadn’t tried it. That incident pushed me to finally give it a shot. In this guide, I’ll share how I built a production-ready, multi-architecture EKS cluster that cut our compute costs by 70% while making scaling nearly instantaneous.
The complete source code is available on GitHub.
- Switched from Cluster Autoscaler to Karpenter
- Cut compute costs by 70% using spot + Graviton
- Reduced pod scheduling latency from 3 mins to 20 seconds
- Built multi-arch EKS cluster with AMD64 + ARM64
- Full Terraform + CI/CD pipeline on GitHub
Ditching Traditional Autoscaling
Here’s what I did not like about the standard Kubernetes Cluster Autoscaler: I’d have pods sitting in “Pending” state for 2-3 minutes waiting for new nodes, while paying for a bunch of over-provisioned m5.large instances that were barely hitting 20% CPU utilization most of the time.
I remember one particularly frustrating incident where a traffic spike hit our app at 2 AM. The cluster autoscaler took over 4 minutes to provision new nodes, and by then our users were getting timeouts. That’s when I started seriously looking at Karpenter.
What makes Karpenter different is that it doesn’t think in terms of fixed node groups. Instead, it looks at your pending pods and says, “OK, you need 2 vCPUs and 4GB RAM? Let me find the cheapest spot instance that fits those requirements.” It’s like having a really smart provisioning assistant that actually understands your workload patterns.
After implementing it, I saw our node provisioning times drop from 3+ minutes to under 20 seconds. And the cost savings — about a 70% reduction in our compute bill, which freed up budget for other infrastructure improvements.
Our Multi-Architecture Setup
One thing that impressed me about Karpenter was how easily it handled our mixed workload requirements. I had some legacy PHP applications that needed x86 instances, but I also wanted to experiment with ARM64 Graviton instances for our newer microservices.
The architecture I ended up with looks like this:
┌─────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AMD64 │ │ ARM64 │ │ Karpenter │ │
│ │ NodePool │ │ NodePool │ │ Controller │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Spot Instance Handling │ │
│ │ SQS Queue + EventBridge Rules │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The architecture consists of several key components:
- VPC Infrastructure: Getting the networking right was crucial. I spent some time getting the subnet tagging perfect — those
karpenter.sh/discovery
tags aren’t just nice-to-have, they’re essential for Karpenter to find your subnets. I forgot thekarpenter.sh/discovery
tag once — took an hour to debug why it wouldn’t launch nodes.
- EKS Cluster: The OIDC provider integration was probably the trickiest part to get right initially. It’s what allows Karpenter to securely talk to AWS APIs without hardcoded credentials.
- Karpenter Controller: This is where the magic happens. Once deployed, it constantly watches for unscheduled pods and intelligently provisions the right instances.
- Multi-Architecture Support: Having separate node pools for AMD64 and ARM64 lets us run different workloads on the most cost-effective hardware.
- Spot Instance Management: The interruption handling was something I was initially worried about, but AWS’s SQS integration makes it surprisingly robust.
The Terraform Structure That Actually Works
My first attempt at structuring this was a mess. I had everything in one giant main.tf
file, and making changes was terrifying. After some refactoring (and a few late nights), I landed on this modular approach:
├── eks-module/ # EKS cluster creation
├── karpenter/ # Karpenter autoscaler setup
├── vpc/ # VPC infrastructure
├── root/ # Main Terraform execution
└── .github/workflows/ # CI/CD pipeline
VPC: Getting the Networking Foundation Right
The VPC setup took me a few iterations to get right. The key insight was that the tagging strategy isn’t just documentation — it’s functional.
Here’s what I learned works:
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr_block
enable_dns_support = true
enable_dns_hostnames = true
tags = {
Name = "${var.project_name}-vpc"
"kubernetes.io/cluster/${var.project_name}" = "shared"
}
}
resource "aws_subnet" "private_subnets" {
# ... configuration
tags = {
Name = "private-subnet-${count.index}"
"karpenter.sh/discovery" = var.project_name
"kubernetes.io/role/internal-elb" = "1"
}
}
EKS and Karpenter: The Heart of the System
The EKS setup was straightforward, but getting the OIDC provider right was critical. This is what allows Karpenter to assume IAM roles securely:
resource "aws_eks_cluster" "project" {
name = var.cluster_name
role_arn = aws_iam_role.cluster_role.arn
vpc_config {
subnet_ids = var.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = true
}
access_config {
authentication_mode = "API"
}
}
resource "aws_iam_openid_connect_provider" "eks" {
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
url = aws_eks_cluster.project.identity[0].oidc[0].issuer
}
For the Karpenter deployment, I went with the Helm chart approach. One thing I learned is that you need to install the CRDs first, then the main chart. The interruption queue setup was crucial — it’s what prevents your workloads from getting abruptly terminated when spot instances are reclaimed:
# Install CRDs first
resource "helm_release" "karpenter_crd" {
name = "karpenter-crd"
chart = "oci://public.ecr.aws/karpenter/karpenter-crd"
version = var.karpenter_version
namespace = var.namespace
}
# Then install the main Karpenter chart
resource "helm_release" "karpenter" {
depends_on = [helm_release.karpenter_crd]
name = "karpenter"
chart = "oci://public.ecr.aws/karpenter/karpenter"
version = var.karpenter_version
namespace = var.namespace
set {
name = "settings.clusterName"
value = var.cluster_name
}
set {
name = "settings.aws.interruptionQueueName"
value = aws_sqs_queue.karpenter_interruption_queue.name
}
set {
name = "serviceAccount.annotations.eks\.amazonaws\.com/role-arn"
value = var.controller_role_arn
}
}
Multi-Architecture NodePools
Initially, I was skeptical about ARM64. “Will our applications even work?” But after some testing, I found that our Node.js and Python services ran perfectly on Graviton instances — often with better performance per dollar.
Here’s how I set up the node pools:
AMD64 NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: amd64-nodepool
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "c5.large", "c5.xlarge"]
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 1m
ARM64 NodePool: The Graviton Experiment
The ARM64 setup requires a bit more thought. I added taints to prevent incompatible workloads from accidentally landing on ARM64 nodes:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: arm64-nodepool
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t4g.medium", "c6g.large", "m6g.large"]
taints:
- key: arm64
value: "true"
effect: NoSchedule
I learned that specifying exact AMI IDs gives you more control over what gets deployed, especially when testing different ARM64 configurations. You can also use nodeSelector requirements to target specific architectures without needing taints if your applications are architecture-aware.
Cost Savings with this Approach
Let me share the real impact from our production environment. Before Karpenter, we were running a mix of on-demand instances across multiple node groups — about 15-20 m5.large and c5.xlarge instances running 24/7.
After switching to Karpenter with about 70% spot instance usage, our monthly compute costs dropped by 70%. That’s a significant reduction that freed up substantial budget for new features and infrastructure improvements.
The ARM64 Benefits
The ARM64 instances provided an additional bonus. Beyond the ~20% cost savings compared to equivalent x86 instances, I saw better performance on some of our CPU-intensive workloads. Our image processing service, for example, ran about 15% faster on Graviton instances.
Here’s what I typically see across our clusters:
- 70% reduction in compute costs
- Sub-20-second scaling instead of minutes
- Better resource utilization — no more paying for idle CPU cycles
The performance monitoring showed we went from average 25% CPU utilization on our fixed nodes up to 70% utilization with Karpenter’s right-sizing.
Production Lessons Learned
Security: IRSA Is Your Friend
Getting the security model right was crucial. I use IAM Roles for Service Accounts (IRSA) everywhere, which means no more hardcoded AWS credentials floating around in our cluster:
resource "aws_iam_role" "karpenter_controller" {
name = "KarpenterController-${var.cluster_name}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRoleWithWebIdentity"
Effect = "Allow"
Principal = {
Federated = var.oidc_provider_arn
}
Condition = {
StringEquals = {
"${replace(var.oidc_provider_url, "https://", "")}:sub" = "system:serviceaccount:karpenter:karpenter"
}
}
}
]
})
}
Monitoring
After some trial and error, I settled on monitoring these key metrics. Too many dashboards become noise, but these four tell me everything I need to know about our cluster performance:
- karpenter_nodes_created_total: Are nodes scaling up when they should be?
- karpenter_nodes_terminated_total: Are nodes scaling down appropriately?
- karpenter_pods_state: Are pods getting scheduled quickly?
- karpenter_node_utilization: Are instances being right-sized?
The node utilization metric was particularly eye-opening. We went from ~25% average utilization up to 70% after Karpenter started right-sizing our instances.
CI/CD
My deployment pipeline is pretty straightforward. I use GitHub Actions with OIDC authentication to AWS (no more access keys in secrets!). The workflow supports feature branches for development, staging for testing, and main for production:
name: Terraform Deploy
on:
push:
branches:
- 'feature/**'
- 'staging'
- 'main'
workflow_dispatch:
permissions:
id-token: write # Required for OIDC JWT
contents: read # Required for checkout
jobs:
deploy:
environment: ${{ (github.ref == 'refs/heads/main' && 'prod') || (github.ref == 'refs/heads/staging' && 'staging') || 'dev' }}
runs-on: ubuntu-22.04
defaults:
run:
working-directory: root/
steps:
- name: Clone repo
uses: actions/checkout@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-region: us-east-1
role-to-assume: ${{ vars.IAM_Role }}
role-session-name: gha-assignment-session
- name: Initialize Terraform
run: terraform init -backend-config="bucket=${{ vars.TF_STATE_BUCKET }}"
- name: Terraform Plan
run: terraform plan -input=false -var-file=${{ vars.STAGE }}.tfvars
- name: Terraform Apply
run: terraform apply -auto-approve -input=false -var-file=${{ vars.STAGE }}.tfvars
What I like about this setup is the dynamic environment selection and the use of GitHub variables for different stages. The terraform fmt -check
step ensures code consistency across our team.
Summary
Look, I’m not going to pretend this was all smooth sailing. I had some initial hiccups with spot instance interruptions (protip: make sure your applications handle SIGTERM gracefully), and getting the ARM64 workloads right took some iteration.
But the results speak for themselves. We’re running a more resilient, cost-effective infrastructure that scales intelligently. The substantial cost savings alone paid for the engineering time I spent on this migration within the first month.
If you’re dealing with unpredictable workloads, rising AWS bills, or just want to try something that feels like “the future of Kubernetes,” I’d definitely recommend giving Karpenter a shot. Start with a development cluster, get comfortable with the concepts, and then gradually migrate your production workloads.
The complete implementation is available on GitHub. I’ve included all the Terraform modules, testing scripts, and a troubleshooting guide based on the issues I ran into. Feel free to use it as a starting point for your own implementation — and let me know how it goes!