Running Ray At Scale On AKS

The Azure Kubernetes Service (AKS) team at Microsoft has shared guidance for running Anyscale’s managed Ray service at scale. They focus on three key issues: GPU capacity limits, scattered ML storage, and problems with credential expiry.

This post expands on a previous overview of open-source KubeRay on AKS. Now, it highlights Anyscale’s improved runtime, previously known as RayTurbo. This runtime offers smart autoscaling, improved monitoring, and fault-tolerant training features. They are all based on the open-source Ray framework.

Ray is a Python-native distributed compute framework designed to scale AI and ML workloads from a single laptop to clusters spanning thousands of nodes. Anyscale’s managed platform enhances Ray with features for production use. The new guidance shows a partnership between Microsoft and Anyscale to improve Azure integration.

GPU scarcity is one of the most significant operational challenges in large-scale ML. High-demand accelerators, such as NVIDIA GPUs, often have quota and availability issues in Azure regions. This can delay cluster setup and job scheduling.

Microsoft’s proposed solution uses a multi-cluster, multi-region setup. Distributing Ray clusters across different AKS instances in various Azure regions allows teams to: Aggregate GPU quota beyond regional limits, automatically reroute workloads during outages or capacity issues and extend the compute pool to on-premises systems or other cloud providers using Azure Arc with AKS.

The Anyscale console shows these registered clusters in one view. Anyscale Workspaces manages workload scheduling using available capacity, either manually or automatically. You can add new regions by creating a cloud_resource.yaml manifest. Then, apply it using the Anyscale CLI. This configuration-first approach makes multi-region expansion easy to manage.

A common issue in ML operations is transferring training data, model checkpoints, and artifacts between pipeline stages. This includes moving them from pre-training to fine-tuning and then to inference. The guidance addresses this with Azure BlobFuse2, which mounts Azure Blob Storage into Ray worker pods as a POSIX-compatible filesystem.

From Ray’s perspective, the mount point is just a local directory. Tasks and actors read datasets and write checkpoints using standard file I/O. BlobFuse2 then saves data to Azure Blob Storage. This makes data available across pods and node pools. Local caching prevents GPU stalls during large training runs, and because data is decoupled from compute, Ray clusters can scale up and down without data loss.

To set up, enable the blob CSI driver when creating the cluster. Then, define a StorageClass that uses workload identity for authentication. Finally, create a PersistentVolumeClaim with ReadWriteMany access. This allows multiple Ray workers on different nodes to access shared data at the same time. This approach makes Ray code portable. It also adds the durability and scalability of Azure-native storage to the infrastructure layer.

Another important topic is the authentication reliability. Anyscale and Azure used to integrate with CLI tokens or API keys that expired every 30 days. This meant manual rotation was needed, which risked service disruption.

The new method uses Microsoft Entra service principals and AKS workload identity. It issues short-lived tokens automatically. The Anyscale Kubernetes Operator pod uses a user-assigned managed identity. This identity requests an access token for the Anyscale service principal from Entra ID. Azure handles token refresh transparently, meaning no long-lived credentials are stored in the cluster and no manual rotation is required.

The authors say this is especially important in multi-cluster environments. Here, managing credentials by hand across many clusters adds to the operational burden. The workload identity model provides fine-grained RBAC for Azure resource access and produces full audit trails through Azure Activity Logs as a byproduct.

The Anyscale on AKS integration is currently in private preview. Teams wanting access should reach out to their Microsoft account team. They can also file a request on the AKS GitHub repository. Include details about Ray workloads and target regions. You can check out example setups and workloads for fine-tuning with DeepSpeed and LLaMA-Factory in the Azure-Samples/aks-anyscale repository on GitHub. This also includes LLM inference endpoints.

Microsoft is not the sole entity making this wager. AWS announced its Anyscale partnership at Ray Summit 2024. This connects EKS clusters to the RayTurbo runtime. It highlights hardware flexibility by combining NVIDIA GPUs with AWS’s Trainium and Inferentia accelerators. Additionally, SageMaker HyperPod is now a deployment target for long-running training jobs that need node-level resilience. Google Cloud leads in open-source contributions.

The GKE team worked with Anyscale engineers to upstream label-based scheduling into Ray v2.49. They also created a ray.util.tpu layer to reduce resource fragmentation in multi-chip TPU setups. Additionally, they added Dynamic Resource Allocation for the new GB200-backed instances.

All three hyperscalers have chosen the same managed Ray operator, and each has added their infrastructure. This shows the industry prefers Kubernetes-plus-Ray for AI workloads. Now, the competition is less about the runtime and more about which cloud can streamline the surrounding infrastructure best.