KubeCon NA 2025 - Robert Nishihara On Open Source AI Compute With Kubernetes, Ray, PyTorch, And VLLM

AI workloads are growing more complex in terms of compute and data, and technologies like Kubernetes and PyTorch can help build production-ready AI systems to support them. Robert Nishihara from Anyscale recently spoke at KubeCon + CloudNativeCon North America 2025 Conference about how an AI compute stack comprising Kubernetes, PyTorch, vLLM, and Ray technologies can support these new AI workloads.

Ray is an open-source framework designed to build and scale machine learning and Python applications. It orchestrates infrastructure for distributed workloads and was developed at Berkeley during a reinforcement learning research project. Recently, Ray became part of the PyTorch Foundation to contribute to the broader open-source AI ecosystem.

Nishihara emphasized three main areas driving the evolution of AI workloads: data processing, model training, and model serving. Data processing must adapt to the emerging data types needed for AI applications, expanding beyond traditional tabular data to include multimodal datasets (which can encompass images, videos, audio, text, and sensor data). This evolution is crucial for supporting inference tasks, which are a fundamental component of AI-powered applications. Additionally, the hardware used for data storage and compute operations needs to support GPUs alongside standard CPUs. He noted that data processing has shifted from “SQL operations on CPUs” to “inferences on GPUs.”

Model training involves reinforcement learning (RL) and post-training tasks, including generating new data by running inference on models. Ray’s Actor API can be leveraged for Trainer and Generator components. An “Actor” is essentially a stateful worker that creates a new worker class when instantiated and manages method scheduling on that specific worker instance. Furthermore, Ray’s native Remote Direct Memory Access (RDMA) support allows for the direct transport of GPU objects over RDMA, enhancing performance.

Several open-source reinforcement learning frameworks have been developed on top of Ray. For instance, the AI-powered code editor tool Cursor’s composer is built on Ray. Nishihara also mentioned other notable frameworks, such as Verl (Bytedance), OpenRLHF, ROLL (Alibaba), NeMO-RL (Nvidia), and SkyRL (UC Berkeley), which utilize training engines like Hugging Face, FSDP, DeepSpeed, Megatron, and serving engines like Hugging Face, vLLM, SGLang, and OpenAI, all orchestrated by Ray.

He shared the application architecture behind Ray, noting that increasing complexity exists in both the upper and lower layers. There is a growing need for software stacks that connect applications at the top layer to hardware at the bottom layer. The top layers include AI workloads, model training, and inference frameworks like PyTorch, vLLM, Megatron, and SGLang. In contrast, the bottom layers consist of computing substrates (GPUs and CPUs) and container orchestrators like Kubernetes and Slurm. Distributed compute frameworks such as Ray and Spark act as bridges between these top and bottom tier components, handling data ingestion and data movement.

Kubernetes and Ray complement one another for hosting AI applications, extending container-level isolation with process-level isolation and offering both vertical and horizontal autoscaling. Nishihara pointed out that while the demands for inference stage increase and decrease compared to model training, it becomes beneficial to shift GPUs between these stages, a capability made possible by using Ray and Kubernetes together.

In conclusion, Nishihara underscored the core requirements of AI platforms, which must support a native multi-cloud experience, workload prioritization across GPU reservations, observability and tooling, model and data lineage tracking, and overall governance. Observability is essential at both the container level and the workload and process levels to monitor metrics, such as object transfer speeds.