Google Cloud has announced the general availability of NVIDIA GPU support for Cloud Run, its serverless runtime. With this enhancement, Google Cloud aims to provide a powerful, yet remarkably cost-efficient, environment for a wide range of GPU-accelerated use cases, particularly in AI inference and batch processing.
In a company blog post, Google highlights that developers favor Cloud Run for its simplicity, flexibility, and scalability. With the addition of GPU support, it now extends its core benefits to GPU resources:
- Pay-per-second billing: Users are now charged only for the GPU resources they consume, down to the second – thus minimizing waste.
- Scale to zero: Cloud Run automatically scales GPU instances down to zero when inactive, eliminating idle costs – particularly beneficial for sporadic or unpredictable workloads.
- Rapid startup and scaling: Instances with GPUs and drivers can start up in under 5 seconds, enabling applications to respond to demand very quickly.
- Full streaming support: Built-in support for HTTP and WebSocket streaming allows for interactive applications, such as real-time LLM responses.
Dave Salvator, director of accelerated computing products at NVIDIA, commented:
Serverless GPU acceleration represents a major advancement in making cutting-edge AI computing more accessible. With seamless access to NVIDIA L4 GPUs, developers can now bring AI applications to production faster and more cost-effectively than ever before.
A significant barrier to entry has been removed, as NVIDIA L4 GPU support on Cloud Run is now available to all users with no quota request required. Developers can enable GPU support via a simple command-line flag (–gpu 1) or by checking a box in the Google Cloud console.
Cloud Run with GPU support is production-ready, covered by Cloud Run’s Service Level Agreement (SLA) for reliability and uptime. It offers zonal redundancy by default for resilience, with an option for lower pricing for best-effort failover in case of a zonal outage by turning off zonal redundancy.
The general availability of GPU support on Cloud Run has also sparked a discussion within the developer community regarding its competitive implications, particularly in relation to other major cloud providers. Rubén del Campo, a principal software engineer at ZenRows, highlighted Google’s move as something “AWS should have built years ago: serverless GPU compute that actually works.”
His perspective highlights a perceived “massive gap in AWS Lambda’s capabilities,” specifically citing Lambda’s 15-minute timeout and CPU-only compute as prohibitive for modern AI workloads, such as Stable Diffusion inference, model fine-tuning, or real-time video analysis. “Try running Stable Diffusion inference, fine-tuning a model, or processing video with AI in Lambda. You can’t,” the user commented, emphasizing that Cloud Run GPUs make such tasks “trivial with serverless GPUs that scale to zero.”
While Cloud Run GPUs offer compelling features, some users on a Hacker News thread have raised concerns regarding the lack of hard billing limits, which could lead to unexpected costs. While Cloud Run allows setting maximum instance limits, it doesn’t provide an actual dollar-based spending cap.
In addition, comparisons on the same Hacker News thread also indicate that other providers like Runpod.io may offer more competitive pricing for similar GPU instances. For example, some users have pointed out that Runpod’s hourly rates for L4, A100, and H100 GPUs can be significantly lower than Google’s, even when considering Google’s per-second billing.
Beyond real-time inference, Google has also announced the availability of GPUs on Cloud Run jobs (currently in private preview), unlocking new use cases for batch processing and asynchronous tasks. These features are supported globally, with Cloud Run GPUs available in five Google Cloud regions: us-central1 (Iowa, USA), europe-west1 (Belgium), europe-west4 (Netherlands), asia-southeast1 (Singapore), and asia-south1 (Mumbai, India). Additional regions are planned.
Lastly, the company states that developers can start building with Cloud Run GPUs by leveraging the official documentation, quickstarts, and best practices for optimizing model loading.