Google recently launched third-party generative AI inference for open models in BigQuery, allowing data teams to deploy and run any model from Hugging Face or Vertex AI Model Garden using plain SQL. With this interface in preview, there is no longer a need for separate ML infrastructure, as it automatically spins up compute resources, manages endpoints, and cleans up everything through BigQuery’s SQL interface.
The new capability tackles a problem data teams have dealt with for some time. Running open-source models previously meant managing Kubernetes clusters, configuring endpoints, and juggling multiple tools. Virinchi T, writing in a Medium article about the launch, put it this way:
This process requires multiple tools, different skill sets, and significant operational overhead. For many data teams, this friction means AI capabilities remain out of reach—even when the models themselves are freely available.
Yet, with BigQuery’s SQL interface, the entire workflow boils down to two SQL statements. Users create a model with one CREATE MODEL statement that specifies a Hugging Face model ID (like sentence-transformers/all-MiniLM-L6-v2) or a Vertex AI Model Garden model name. BigQuery automatically provisions compute resources with default configurations, typically completing deployment in 3-10 minutes depending on the model size.
Next, users run inference using AI.GENERATE_TEXT for language models or AI.GENERATE_EMBEDDING for embeddings, querying data straight from BigQuery tables. The platform manages the resource lifecycle via the endpoint_idle_ttl option, which shuts down idle endpoints to prevent charges. Furthermore, they can also manually undeploy endpoints with ALTER MODEL statements when batch jobs wrap up.
The feature supports customization for production use cases. Users can set machine types, replica counts, and endpoint idle times right in the CREATE MODEL statement. Compute Engine reservations can lock in GPU instances for steady performance. When they are done with a model, a quick DROP MODEL statement automatically wipes out all associated Vertex AI resources.
Google’s blog post describes the system as providing “granular resource control” and “automated resource management,” letting teams find the right balance between performance and cost without leaving SQL. An earlier blog post from September 2025 showed processing 38 million rows for roughly $2-3 using similar patterns with open-source embedding models.
The feature works with over 13,000 Hugging Face text embedding models and 170,000+ text generation models, covering Meta’s Llama series and Google’s Gemma family. Models need to comply with Vertex AI Model Garden deployment requirements, including regional availability and quota limits.
Virinchi T highlighted what this means for different roles:
For Data Analysts: You can now experiment with ML models without leaving your SQL environment or waiting for engineering resources. For Data Engineers: Building ML-powered data pipelines becomes dramatically simpler—no separate ML infrastructure to maintain.
The launch puts BigQuery up against Snowflake’s Cortex AI and Databricks’ Model Serving, both of which offer SQL-accessible ML inference. BigQuery’s edge might be its direct integration with Hugging Face’s massive model catalog in the data warehouse, which could appeal to users already running on Google Cloud.
Documentation and tutorials are available for text generation with Gemma models and embedding generation.
