Generative AI technologies need to support new workloads, traffic patterns, and infrastructure demands and require a new set of tools for the age of GenAI. Erica Hughberg from Tetrate and Alexa Griffith from Bloomberg spoke last week at KubeCon + CloudNativeCon North America 2025 Conference about what it takes to build GenAI platforms capable of serving model inference at scale.
The new requirements for Gen AI based appilcations include dynamic, model-based routing, token-level rate limiting, secure & centralized credential management, and observability, resilience & failover for AI. Existing tools are not sufficient to support these use cases due to their lack of AI-native logic, simple rate limiting, and request based routing. Kubernetes platform and tools like KServe, vLLM, Envoy and llm-d can be used to implement these new requirements. And for monitoring and observability of AI applications, we can leverage frameworks like OpenTelemetry, Prometheus, and Grafana.
The speakers discussed their AI application architecture developed using open source projects like Envoy AI Gateway and KServe. Envoy AI Gateway helps manage traffic at the edge and provides unified access from application clients to GenAI services like Inference Service or Model Context Protocol (MCP) Server. Its design is based on a two-tier gateway pattern with Tier One Gateway, referred to as AI Gateway, functioning as a centralized entry point and is responsible for authentication, top-level routing, unified LLM API, and token-based rate limiting. It can also acts as a MCP proxy.
And the Tier Two Gateway, referred to as Reference Gateway, manages the ingress traffic to the AI models hosted on a Kubernetes cluster and is also responsible for fine-grained control to access the models. Envoy AI Gateway supports different AI providers like OpenAI, Azure OpenAI, Google Gemini, Vertex AI, AWS Bedrock, and Anthropic.
KServe is the open-source standard for self-hosted models, providing a unified platform for generative and predictive AI inference on Kubernetes platform. As a single, declarative API for models, it can provide a stable, internal endpoint for each model to which the Envoy AI Gateway can route traffic. It’s recently been retooled to support Generative AI capabilities like LLM multi-framework support, OpenAI-compatible APIs, LLM model caching, KV cache offloading, multi-node inference, metric-based autoscaling, and native support for Hugging Face models with streamlined deployment workflows.
KServe provides a Kubernetes custom resource definition (CRD), built on the foundation of llm-d, a Kubernetes-native LLM inference framework, for serving the models on different frameworks like PyTorch, TensorFlow, ONNX, or HuggingFace. The CRD’s K8s configuration YAML script includes the type InferenceService where we can specify the model metadata and gateway API for external access.
Hughberg and Griffith concluded the presentation by reiterating that GenAI brings stateful, resource-intensive, and tokenbased workloads. You will need AI-native capabilities like dynamic, model-based routing, and token-level rate limiting & cost control. CNCF tools like Kubernetes, Envoy AI Gateway, and KServe can help with developing Gen AI based applications.
