Cloudflare has launched a managed service for using retrieval-augmented generationin LLM-based systems. Now in beta, CloudFlare AutoRAG aims to make it easier for developers to build pipelines that integrate rich context data into LLMs.
Retrieval-augmented generation can significantly improve how accurately LLMs answer questions involving proprietary or domain-specific knowledge. However, its implementation is far from trivial, explains Cloudflare product manager Anni Wang.
Building a RAG pipeline is a patchwork of moving parts. You have to stitch together multiple tools and services — your data storage, a vector database, an embedding model, LLMs, and custom indexing, retrieval, and generation logic — all just to get started.
To make matters worse, the whole process must be repeated each time your knowledge base changes.
To improve on this, Cloudflare AutoRAG automates all steps required for retrieval-augmented generation: it ingests the data, automatically chunks and embeds it, stores the resulting vectors in Cloudflare’s Vectorize database, performs semantic retrieval, and generates responses using Workers AI. It also monitors all data sources in the background and reruns the pipeline when needed.
The two main processes behind AutoRAG are indexing and querying, explains Wang. Indexing begins by connecting a data source, which is ingested, transformed, vectorized using an embeddings model, and optimized for queries. Currently, AutoRAG supports only Cloudflare R2-based sources and can to process PDFs, images, text, HTML, CSV, and more. All files are converted into structured Markdown, including images for which a combination of object detection and vision-to-language transformation is used.
The querying process starts when an end user makes a request through the AutoRAG API. The prompt is optionally rewritten to improve its effectiveness, then vectorized using the same embeddings model applied during indexing. The resulting vector is used to search the Vectorize database, returning the relevant chunks and metadata that help retrieve the original content from the R2 data source. Finally, the retrieved context is combined with the user prompt and passed to the LLM.
On Linkedn, Stratus Cyber CEO Ajay Chandhok noted that “in most cases AutoRAG implementation requires just pointing to an existing R2 bucket. You drop your content in, and the system automatically handles everything else”.
Another benefit of AutoRAG, says BBC senior software engineer Nicholas Griffin, is that it “makes querying just a few lines of code”.
Some skepticism surfaced on X, where Poojan Dalal pointed out that “production grade scalable RAG systems for enterprises have much more requirements and components than just a single pipeline” adding that it’s not just about semantic search.
Engineer Pranit Bauva, who successfully used AutoRAG to create a RAG app, also pointed out several limitations in its current form: few options for embedding and chunking, slow query rewriting, and an AI Gateway that only works with Llama models—possibly due to an early-stage bug. He also noted that retrieval quality is lacking and emphasized that, for AutoRAG to be production-ready, it must offer a way to evaluate whether the correct context was retrieved to answer a given question.