By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Amazon EKS Enables Ultra-Scale AI/ML Workloads with Support for 100K Nodes per Cluster
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Amazon EKS Enables Ultra-Scale AI/ML Workloads with Support for 100K Nodes per Cluster
News

Amazon EKS Enables Ultra-Scale AI/ML Workloads with Support for 100K Nodes per Cluster

News Room
Last updated: 2025/09/04 at 3:33 AM
News Room Published 4 September 2025
Share
SHARE

Amazon Web Services has announced a significant breakthrough in container orchestration with Amazon Elastic Kubernetes Service (EKS) now supporting clusters with up to 100,000 nodes, a 10x increase from previous limits. This enhancement enables unprecedented scale for artificial intelligence and machine learning workloads, potentially supporting up to 1.6 million AWS Trainium chips or 800,000 NVIDIA GPUs in a single Kubernetes cluster.

The most advanced AI models, with trillions of parameters, demonstrate significantly superior capabilities in context understanding, reasoning, and solving complex tasks. However, developing and managing these increasingly powerful models requires access to massive numbers of compute accelerators within a single cluster. Partitioning those jobs into separate clusters can lower utilization due to capacity fragmentation or remapping delays, making single large-scale clusters essential for optimal performance.

Running them within a single cluster offers certain key benefits. First, it lowers compute costs by driving up utilization through a shared capacity pool for running heterogeneous jobs ranging from large pre-training to fine-tuning experiments and batch inferencing. Additionally, centralized operations such as scheduling, discovery, and repair are significantly simplified compared to managing split-cluster deployments.

AWS achieved this 100K node capability through several architectural breakthroughs, fundamentally re-engineering the core components of Kubernetes clusters while maintaining full Kubernetes conformance.

The most significant innovation lies in the complete overhaul of etcd, Kubernetes’ core data store. Through a foundational change, Amazon EKS has offloaded etcd’s consensus backend from a raft-based implementation to journal, an internal component AWS has been building for more than a decade. This journal system provides ultra-fast, ordered data replication with multi-Availability Zone durability.

AWS also moved etcd’s backend database completely to in-memory storage using tmpfs, providing order-of-magnitude performance wins in the form of higher read/write throughput, predictable latencies, and faster maintenance operations. The maximum supported database size has been doubled to 20 GB while maintaining low mean-time-to-recovery during failures.

The engineering team implemented extensive tuning of API servers and critical webhooks, carefully optimizing configurations such as request timeouts, retry strategies, work parallelism, and throttling rules. Kubernetes v1.31 introduced strongly consistent reads from cache that allowed offloading a big portion of read traffic from etcd to the API server, cutting server-side CPU usage by 30% and speeding up list requests by three times.

Controllers operating at cluster scope received significant improvements to minimize lock contention and enable batch processing of events. The Kubernetes scheduler consistently delivered throughput of up to 500 pods/second even at the 100K node scale by carefully tailoring scheduler plugins based on the workload and optimizing node filtering/scoring parameters.

For networking, AWS configured the Amazon VPC CNI with prefix mode for address management, allowing streamlined network operations with a single VPC for 100K nodes while achieving up to a three-fold improvement in node launch rates. For accelerated workloads requiring high bandwidth, they enabled pod ENIs on additional network cards, enhancing the pod’s network bandwidth capacity (above 100 GB/s) and packet rate performance.

Container image management received attention through Seekable OCI (SOCI) fast pull technology, which enables concurrent download and unpacking operations for large AI/ML container images often exceeding 5 GB. Combined with parallel unpacking capabilities, testing demonstrates up to a 2x reduction in overall image download and unpack compared to the default.

AWS conducted extensive testing simulating real-world ultra-scale AI/ML scenarios. The testing covered massive pre-training jobs running on all 100K nodes, 10 parallel fine-tuning jobs each using 10K nodes, and mixed-mode workloads combining fine-tuning and inference tasks.

 

AI/ML testing scenarios running on 100K nodes

Node lifecycle testing showed Karpenter could launch 100K Amazon EC2 instances in 50 minutes, at 2000 ready nodes joining the cluster per minute. Cluster drift operations to update all nodes to new AMIs completed in approximately 4 hours while respecting node disruption budgets.

Performance metrics during testing were impressive: the cluster contains beyond 10 million Kubernetes objects, including 100K nodes and 900K pods, and the aggregate etcd database size across partitions reaches 32 GB. API latencies remained well within Kubernetes SLO targets throughout all testing scenarios.

This advancement particularly benefits organizations working on cutting-edge AI research and large-scale machine learning operations. Besides customers directly consuming Amazon EKS today, these improvements also extend to other AI/ML services like Amazon SageMaker HyperPod with EKS that leverage EKS as their compute layer, advancing AWS’s overall ultra scale computing capabilities.

This announcement positions AWS significantly ahead of its major cloud competitors in terms of Kubernetes cluster scale. Google Kubernetes Engine (GKE) currently supports a maximum of 15,000 nodes per Standard cluster, with higher limits requiring special approval and specific configurations such as regional clusters with Private Service Connect. Microsoft Azure Kubernetes Service (AKS) supports up to 5,000 nodes per cluster with Virtual Machine Scale Sets, though this limit may require contacting support for clusters approaching the upper boundary.

AWS’s 100,000-node capability represents a 6.7x improvement over GKE’s standard limits and a 20x improvement over AKS’s maximum, establishing a substantial competitive advantage for organizations requiring massive-scale AI/ML infrastructure. This gap becomes even more pronounced when considering that competitors’ higher limits often come with additional restrictions or require special approval processes, while AWS’s ultra-scale clusters are designed as a standard offering with full Kubernetes conformance.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article How to Develop an Icon Library That is Consistent and Scalable
Next Article Trust Wallet Brings Tokenized Stocks & ETFs Onchain for 200M+ Users Worldwide | HackerNoon
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

When Will Apple Release iOS 26? Here's My Prediction
News
Lowkick Studio Launches $SHARDS Token on Top Tier Exchanges for WorldShards MMORPG | HackerNoon
Computing
The hidden risks of cognitive offloading – News
News
I Tested More Than a Dozen Pixel 10 Cases. These Are the Best
Gadget

You Might also Like

News

When Will Apple Release iOS 26? Here's My Prediction

5 Min Read
News

The hidden risks of cognitive offloading – News

9 Min Read
News

FCC to Vote on Allowing Cellphone Jamming in Prisons

5 Min Read
News

All the rumors swirling about iPhone 17’s new camera features and upgrades

6 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?