Amazon EKS Enables Ultra-Scale AI/ML Workloads With Support For 100K Nodes Per Cluster

Amazon Web Services has announced a significant breakthrough in container orchestration with Amazon Elastic Kubernetes Service (EKS) now supporting clusters with up to 100,000 nodes, a 10x increase from previous limits. This enhancement enables unprecedented scale for artificial intelligence and machine learning workloads, potentially supporting up to 1.6 million AWS Trainium chips or 800,000 NVIDIA GPUs in a single Kubernetes cluster.

The most advanced AI models, with trillions of parameters, demonstrate significantly superior capabilities in context understanding, reasoning, and solving complex tasks. However, developing and managing these increasingly powerful models requires access to massive numbers of compute accelerators within a single cluster. Partitioning those jobs into separate clusters can lower utilization due to capacity fragmentation or remapping delays, making single large-scale clusters essential for optimal performance.

Running them within a single cluster offers certain key benefits. First, it lowers compute costs by driving up utilization through a shared capacity pool for running heterogeneous jobs ranging from large pre-training to fine-tuning experiments and batch inferencing. Additionally, centralized operations such as scheduling, discovery, and repair are significantly simplified compared to managing split-cluster deployments.

AWS achieved this 100K node capability through several architectural breakthroughs, fundamentally re-engineering the core components of Kubernetes clusters while maintaining full Kubernetes conformance.

The most significant innovation lies in the complete overhaul of etcd, Kubernetes’ core data store. Through a foundational change, Amazon EKS has offloaded etcd’s consensus backend from a raft-based implementation to journal, an internal component AWS has been building for more than a decade. This journal system provides ultra-fast, ordered data replication with multi-Availability Zone durability.

AWS also moved etcd’s backend database completely to in-memory storage using tmpfs, providing order-of-magnitude performance wins in the form of higher read/write throughput, predictable latencies, and faster maintenance operations. The maximum supported database size has been doubled to 20 GB while maintaining low mean-time-to-recovery during failures.

The engineering team implemented extensive tuning of API servers and critical webhooks, carefully optimizing configurations such as request timeouts, retry strategies, work parallelism, and throttling rules. Kubernetes v1.31 introduced strongly consistent reads from cache that allowed offloading a big portion of read traffic from etcd to the API server, cutting server-side CPU usage by 30% and speeding up list requests by three times.

Controllers operating at cluster scope received significant improvements to minimize lock contention and enable batch processing of events. The Kubernetes scheduler consistently delivered throughput of up to 500 pods/second even at the 100K node scale by carefully tailoring scheduler plugins based on the workload and optimizing node filtering/scoring parameters.

For networking, AWS configured the Amazon VPC CNI with prefix mode for address management, allowing streamlined network operations with a single VPC for 100K nodes while achieving up to a three-fold improvement in node launch rates. For accelerated workloads requiring high bandwidth, they enabled pod ENIs on additional network cards, enhancing the pod’s network bandwidth capacity (above 100 GB/s) and packet rate performance.

Container image management received attention through Seekable OCI (SOCI) fast pull technology, which enables concurrent download and unpacking operations for large AI/ML container images often exceeding 5 GB. Combined with parallel unpacking capabilities, testing demonstrates up to a 2x reduction in overall image download and unpack compared to the default.

AWS conducted extensive testing simulating real-world ultra-scale AI/ML scenarios. The testing covered massive pre-training jobs running on all 100K nodes, 10 parallel fine-tuning jobs each using 10K nodes, and mixed-mode workloads combining fine-tuning and inference tasks.

AI/ML testing scenarios running on 100K nodes

Node lifecycle testing showed Karpenter could launch 100K Amazon EC2 instances in 50 minutes, at 2000 ready nodes joining the cluster per minute. Cluster drift operations to update all nodes to new AMIs completed in approximately 4 hours while respecting node disruption budgets.

Performance metrics during testing were impressive: the cluster contains beyond 10 million Kubernetes objects, including 100K nodes and 900K pods, and the aggregate etcd database size across partitions reaches 32 GB. API latencies remained well within Kubernetes SLO targets throughout all testing scenarios.

This advancement particularly benefits organizations working on cutting-edge AI research and large-scale machine learning operations. Besides customers directly consuming Amazon EKS today, these improvements also extend to other AI/ML services like Amazon SageMaker HyperPod with EKS that leverage EKS as their compute layer, advancing AWS’s overall ultra scale computing capabilities.

This announcement positions AWS significantly ahead of its major cloud competitors in terms of Kubernetes cluster scale. Google Kubernetes Engine (GKE) currently supports a maximum of 15,000 nodes per Standard cluster, with higher limits requiring special approval and specific configurations such as regional clusters with Private Service Connect. Microsoft Azure Kubernetes Service (AKS) supports up to 5,000 nodes per cluster with Virtual Machine Scale Sets, though this limit may require contacting support for clusters approaching the upper boundary.

AWS’s 100,000-node capability represents a 6.7x improvement over GKE’s standard limits and a 20x improvement over AKS’s maximum, establishing a substantial competitive advantage for organizations requiring massive-scale AI/ML infrastructure. This gap becomes even more pronounced when considering that competitors’ higher limits often come with additional restrictions or require special approval processes, while AWS’s ultra-scale clusters are designed as a standard offering with full Kubernetes conformance.

Amazon EKS Enables Ultra-Scale AI/ML Workloads with Support for 100K Nodes per Cluster

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply