How I Cut AWS Compute Costs By 70% With A Multi-Arch EKS Cluster And Karpenter

After watching our monthly AWS bill climb steadily as our Kubernetes workloads grew, I knew we needed a smarter approach to node provisioning. Traditional cluster autoscaling was burning through our budget with over-provisioned nodes sitting idle, while still taking forever to scale up when I needed capacity.

I’d heard of Karpenter, but I hadn’t tried it. That incident pushed me to finally give it a shot. In this guide, I’ll share how I built a production-ready, multi-architecture EKS cluster that cut our compute costs by 70% while making scaling nearly instantaneous.

The complete source code is available on GitHub.

Switched from Cluster Autoscaler to Karpenter
Cut compute costs by 70% using spot + Graviton
Reduced pod scheduling latency from 3 mins to 20 seconds
Built multi-arch EKS cluster with AMD64 + ARM64
Full Terraform + CI/CD pipeline on GitHub

Ditching Traditional Autoscaling

Here’s what I did not like about the standard Kubernetes Cluster Autoscaler: I’d have pods sitting in “Pending” state for 2-3 minutes waiting for new nodes, while paying for a bunch of over-provisioned m5.large instances that were barely hitting 20% CPU utilization most of the time.

I remember one particularly frustrating incident where a traffic spike hit our app at 2 AM. The cluster autoscaler took over 4 minutes to provision new nodes, and by then our users were getting timeouts. That’s when I started seriously looking at Karpenter.

What makes Karpenter different is that it doesn’t think in terms of fixed node groups. Instead, it looks at your pending pods and says, “OK, you need 2 vCPUs and 4GB RAM? Let me find the cheapest spot instance that fits those requirements.” It’s like having a really smart provisioning assistant that actually understands your workload patterns.

After implementing it, I saw our node provisioning times drop from 3+ minutes to under 20 seconds. And the cost savings — about a 70% reduction in our compute bill, which freed up budget for other infrastructure improvements.

Our Multi-Architecture Setup

One thing that impressed me about Karpenter was how easily it handled our mixed workload requirements. I had some legacy PHP applications that needed x86 instances, but I also wanted to experiment with ARM64 Graviton instances for our newer microservices.

The architecture I ended up with looks like this:

┌─────────────────────────────────────────────────────────────┐
│                     EKS Cluster                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   AMD64     │  │    ARM64    │  │  Karpenter  │          │
│  │  NodePool   │  │  NodePool   │  │ Controller  │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Spot Instance Handling                 │    │
│  │    SQS Queue + EventBridge Rules                    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

The architecture consists of several key components:

VPC Infrastructure: Getting the networking right was crucial. I spent some time getting the subnet tagging perfect — those karpenter.sh/discovery tags aren’t just nice-to-have, they’re essential for Karpenter to find your subnets. I forgot the karpenter.sh/discovery tag once — took an hour to debug why it wouldn’t launch nodes.

EKS Cluster: The OIDC provider integration was probably the trickiest part to get right initially. It’s what allows Karpenter to securely talk to AWS APIs without hardcoded credentials.

Karpenter Controller: This is where the magic happens. Once deployed, it constantly watches for unscheduled pods and intelligently provisions the right instances.

Multi-Architecture Support: Having separate node pools for AMD64 and ARM64 lets us run different workloads on the most cost-effective hardware.

Spot Instance Management: The interruption handling was something I was initially worried about, but AWS’s SQS integration makes it surprisingly robust.

The Terraform Structure That Actually Works

My first attempt at structuring this was a mess. I had everything in one giant main.tf file, and making changes was terrifying. After some refactoring (and a few late nights), I landed on this modular approach:

├── eks-module/            # EKS cluster creation
├── karpenter/             # Karpenter autoscaler setup
├── vpc/                   # VPC infrastructure
├── root/                  # Main Terraform execution
└── .github/workflows/     # CI/CD pipeline

VPC: Getting the Networking Foundation Right

The VPC setup took me a few iterations to get right. The key insight was that the tagging strategy isn’t just documentation — it’s functional.

Here’s what I learned works:

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr_block
  enable_dns_support   = true
  enable_dns_hostnames = true
  
  tags = {
    Name                                        = "${var.project_name}-vpc"
    "kubernetes.io/cluster/${var.project_name}" = "shared"
  }
}

resource "aws_subnet" "private_subnets" {
  # ... configuration
  tags = {
    Name                              = "private-subnet-${count.index}"
    "karpenter.sh/discovery"          = var.project_name
    "kubernetes.io/role/internal-elb" = "1"
  }
}

EKS and Karpenter: The Heart of the System

The EKS setup was straightforward, but getting the OIDC provider right was critical. This is what allows Karpenter to assume IAM roles securely:

resource "aws_eks_cluster" "project" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster_role.arn

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
  }

  access_config {
    authentication_mode = "API"
  }
}

resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
  url             = aws_eks_cluster.project.identity[0].oidc[0].issuer
}

For the Karpenter deployment, I went with the Helm chart approach. One thing I learned is that you need to install the CRDs first, then the main chart. The interruption queue setup was crucial — it’s what prevents your workloads from getting abruptly terminated when spot instances are reclaimed:

# Install CRDs first
resource "helm_release" "karpenter_crd" {
  name       = "karpenter-crd"
  chart      = "oci://public.ecr.aws/karpenter/karpenter-crd"
  version    = var.karpenter_version
  namespace  = var.namespace
}

# Then install the main Karpenter chart
resource "helm_release" "karpenter" {
  depends_on = [helm_release.karpenter_crd]
  
  name       = "karpenter"
  chart      = "oci://public.ecr.aws/karpenter/karpenter"
  version    = var.karpenter_version
  namespace  = var.namespace

  set {
    name  = "settings.clusterName"
    value = var.cluster_name
  }
  
  set {
    name  = "settings.aws.interruptionQueueName"
    value = aws_sqs_queue.karpenter_interruption_queue.name
  }
  
  set {
    name  = "serviceAccount.annotations.eks\.amazonaws\.com/role-arn"
    value = var.controller_role_arn
  }
}

Multi-Architecture NodePools

Initially, I was skeptical about ARM64. “Will our applications even work?” But after some testing, I found that our Node.js and Python services ran perfectly on Graviton instances — often with better performance per dollar.

Here’s how I set up the node pools:

AMD64 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: amd64-nodepool
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["m5.large", "m5.xlarge", "c5.large", "c5.xlarge"]
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 1m

ARM64 NodePool: The Graviton Experiment

The ARM64 setup requires a bit more thought. I added taints to prevent incompatible workloads from accidentally landing on ARM64 nodes:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: arm64-nodepool
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["arm64"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["t4g.medium", "c6g.large", "m6g.large"]
      taints:
      - key: arm64
        value: "true"
        effect: NoSchedule

I learned that specifying exact AMI IDs gives you more control over what gets deployed, especially when testing different ARM64 configurations. You can also use nodeSelector requirements to target specific architectures without needing taints if your applications are architecture-aware.

Cost Savings with this Approach

Let me share the real impact from our production environment. Before Karpenter, we were running a mix of on-demand instances across multiple node groups — about 15-20 m5.large and c5.xlarge instances running 24/7.

After switching to Karpenter with about 70% spot instance usage, our monthly compute costs dropped by 70%. That’s a significant reduction that freed up substantial budget for new features and infrastructure improvements.

The ARM64 Benefits

The ARM64 instances provided an additional bonus. Beyond the ~20% cost savings compared to equivalent x86 instances, I saw better performance on some of our CPU-intensive workloads. Our image processing service, for example, ran about 15% faster on Graviton instances.

Here’s what I typically see across our clusters:

70% reduction in compute costs
Sub-20-second scaling instead of minutes
Better resource utilization — no more paying for idle CPU cycles

The performance monitoring showed we went from average 25% CPU utilization on our fixed nodes up to 70% utilization with Karpenter’s right-sizing.

Production Lessons Learned

Security: IRSA Is Your Friend

Getting the security model right was crucial. I use IAM Roles for Service Accounts (IRSA) everywhere, which means no more hardcoded AWS credentials floating around in our cluster:

resource "aws_iam_role" "karpenter_controller" {
  name = "KarpenterController-${var.cluster_name}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRoleWithWebIdentity"
        Effect = "Allow"
        Principal = {
          Federated = var.oidc_provider_arn
        }
        Condition = {
          StringEquals = {
            "${replace(var.oidc_provider_url, "https://", "")}:sub" = "system:serviceaccount:karpenter:karpenter"
          }
        }
      }
    ]
  })
}

Monitoring

After some trial and error, I settled on monitoring these key metrics. Too many dashboards become noise, but these four tell me everything I need to know about our cluster performance:

karpenter_nodes_created_total: Are nodes scaling up when they should be?
karpenter_nodes_terminated_total: Are nodes scaling down appropriately?
karpenter_pods_state: Are pods getting scheduled quickly?
karpenter_node_utilization: Are instances being right-sized?

The node utilization metric was particularly eye-opening. We went from ~25% average utilization up to 70% after Karpenter started right-sizing our instances.

CI/CD

My deployment pipeline is pretty straightforward. I use GitHub Actions with OIDC authentication to AWS (no more access keys in secrets!). The workflow supports feature branches for development, staging for testing, and main for production:

name: Terraform Deploy

on: 
  push:
    branches:
    - 'feature/**'
    - 'staging'
    - 'main'
  workflow_dispatch:

permissions:
  id-token: write # Required for OIDC JWT
  contents: read  # Required for checkout

jobs: 
  deploy:
    environment: ${{ (github.ref == 'refs/heads/main' && 'prod') || (github.ref == 'refs/heads/staging' && 'staging') || 'dev' }}
    runs-on: ubuntu-22.04
    
    defaults:
      run:
        working-directory: root/
        
    steps:
    - name: Clone repo
      uses: actions/checkout@v3
    
    - name: Configure AWS Credentials
      uses: aws-actions/configure-aws-credentials@v4
      with: 
        aws-region: us-east-1
        role-to-assume: ${{ vars.IAM_Role }}
        role-session-name: gha-assignment-session

    - name: Initialize Terraform
      run: terraform init -backend-config="bucket=${{ vars.TF_STATE_BUCKET }}"

    - name: Terraform Plan
      run: terraform plan -input=false -var-file=${{ vars.STAGE }}.tfvars 
    
    - name: Terraform Apply
      run: terraform apply -auto-approve -input=false -var-file=${{ vars.STAGE }}.tfvars

What I like about this setup is the dynamic environment selection and the use of GitHub variables for different stages. The terraform fmt -check step ensures code consistency across our team.

Summary

Look, I’m not going to pretend this was all smooth sailing. I had some initial hiccups with spot instance interruptions (protip: make sure your applications handle SIGTERM gracefully), and getting the ARM64 workloads right took some iteration.

But the results speak for themselves. We’re running a more resilient, cost-effective infrastructure that scales intelligently. The substantial cost savings alone paid for the engineering time I spent on this migration within the first month.

If you’re dealing with unpredictable workloads, rising AWS bills, or just want to try something that feels like “the future of Kubernetes,” I’d definitely recommend giving Karpenter a shot. Start with a development cluster, get comfortable with the concepts, and then gradually migrate your production workloads.

The complete implementation is available on GitHub. I’ve included all the Terraform modules, testing scripts, and a troubleshooting guide based on the issues I ran into. Feel free to use it as a starting point for your own implementation — and let me know how it goes!

How I Cut AWS Compute Costs by 70% with a Multi-Arch EKS Cluster and Karpenter | HackerNoon

Ditching Traditional Autoscaling

Our Multi-Architecture Setup

The Terraform Structure That Actually Works

VPC: Getting the Networking Foundation Right

EKS and Karpenter: The Heart of the System

Multi-Architecture NodePools

AMD64 NodePool

ARM64 NodePool: The Graviton Experiment

Cost Savings with this Approach

The ARM64 Benefits

Production Lessons Learned

Security: IRSA Is Your Friend

Monitoring

CI/CD

Summary

Leave a Reply Cancel reply

Stay Connected

Latest News

Thanks to this fix, I can finally recommend my favorite Switch 2 case again

The Beginner’s Guide to Leveraging AI in Marketing (Without Losing Your Soul | HackerNoon

Deal: The mighty Samsung Galaxy S25 Ultra is $300 off!

Nothing will premiere the Ear (3) wireless buds much sooner than anyone imagined

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Ditching Traditional Autoscaling

Our Multi-Architecture Setup

The Terraform Structure That Actually Works

VPC: Getting the Networking Foundation Right

EKS and Karpenter: The Heart of the System

Multi-Architecture NodePools

AMD64 NodePool

ARM64 NodePool: The Graviton Experiment

Cost Savings with this Approach

The ARM64 Benefits

Production Lessons Learned

Security: IRSA Is Your Friend

Monitoring

CI/CD

Summary

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News