Cloudflare recently announced support for aggregations in R2 SQL, a new feature that lets developers run SQL queries on data stored in R2. This enhancement expands R2 SQL beyond basic filtering and makes it more useful for analytical workloads without requiring separate data warehouse tools.
R2 SQL now supports SUM, COUNT, AVG, MIN, and MAX, as well as GROUP BY and HAVING clauses. These aggregation functions let developers run SQL analytics directly on data stored in R2 via the R2 Data Catalog, enabling them to quickly summarize data, spot trends, generate reports, and identify unusual patterns in logs. In addition to aggregations, the update introduces schema discovery commands, including SHOW TABLES and DESCRIBE.
Jérôme Schneider, staff software engineer at Cloudflare, Nikita Lapkov, senior software engineer at Cloudflare, and Marc Selwan, senior product manager at Cloudflare, summarize:
Whether you are generating reports, monitoring high-volume logs for anomalies, or simply trying to spot trends in your data, you can now easily do it all within Cloudflare’s Developer Platform without the overhead of managing complex OLAP infrastructure or moving data out of R2.
Jeremy Daly, director of research at CloudZero, comments in his newsletter:
Cloudflare continues to push data closer to the edge with aggregation support in R2 SQL, expanding the kinds of workloads developers can realistically run there.
Source: Cloudflare blog
Schneider, Lapkov, and Selwan explain how they built a distributed GROUP BY execution using scatter-gather and shuffling strategies to run analytics directly over the R2 Data Catalog:
Aggregate queries without “HAVING” and “ORDER BY” can be executed in a fashion similar to filter queries. For filter queries, R2 SQL picks one node to be the coordinator in query execution. This node analyzes the query and consults R2 Data Catalog to figure out which Parquet row groups may contain data relevant to the query. Each Parquet row group represents a relatively small piece of work that a single compute node can handle. Coordinator node distributes the work across many worker nodes and collects results to return them to the user.
Cloudflare has separately announced that R2 Data Catalog now supports automatic snapshot expiration for Apache Iceberg tables, complementing automatic compaction, which optimizes query performance by combining small data files into larger ones. Selwan comments:
These go hand in hand because the metadata cleanup/management that snapshot expiration helps with will speed up performance of these aggregation queries, especially with compaction enabled.
The hyperscaler has recently published a deep-dive article that documents how its distributed query engine works.
As R2 SQL is still in public beta, the supported SQL grammar may change over time. A documentation page covers the current limitations and best practices.
