In a recent article from the engineering team behind the Zero Trust product suite, Cloudflare explains why it chose TimescaleDB over ClickHouse to add analytics and reporting capabilities to its internal platform. The author highlights the “phenomenal balance” between the simplicity of storing analytical data alongside configuration data and the performance of a specialized OLAP system.
Focusing on the importance of minimalism in engineering, Cloudflare explains how it built Digital Experience Monitoring (DEX), an internal observability platform that provides visibility into device, network, and application performance across Cloudflare Zero Trust environments.
The team built a configuration plane, an interface for creating and managing synthetic tests, and an analytics plane, an ingestion pipeline that collects structured logs from WARP clients, stores them, and visualizes them in the dashboard.
While Cloudflare has been using ClickHouse since 2017, Robert Cepa, until recently a senior software engineer at Cloudflare, explains why the team chose not to use it for this project:
The default and most commonly used table engine in ClickHouse, MergeTree, is optimized for high-throughput batch inserts. It writes each insert as a separate partition, then runs background merges to keep data manageable. This makes writes very fast, but not when they arrive in lots of tiny batches, which was exactly our case with millions of individual devices uploading one log event every 2 minutes. Too many small writes can trigger write amplification, resource contention, and throttling.
To keep the initial release simple and deliver a working DEX MVP within four months, the team used PostgreSQL for both configuration data and analytical logs, handling 200 inserts per second at launch and query latencies in the hundreds of milliseconds for most customers. But PostgreSQL alone could not be the solution, as Cepa adds:
As adoption grew, we scaled to 1,000 inserts/sec, and our tables grew to billions of rows. That’s when we started to see performance degradation, particularly for large customers querying 7+ day time ranges across tens of thousands of devices.
As the project grew to billions of device logs, the team explored precomputing aggregates (downsampling) to improve performance, precomputing and storing summaries in advance rather than querying the raw data repeatedly. This improvement resulted in a 1000x increase in query performance, and charts that previously took several seconds to render could now be displayed instantly, even for 7-day views across tens of thousands of devices.
Source: Cloudflare blog
As PostgreSQL does not automatically refresh materialized views or manage table partitions, the team turned to TimescaleDB for its support of columnstore and sparse indexes. Available under an Apache 2.0 license, TimescaleDB is an open-source time-series database built as an extension to PostgreSQL, optimizing storage and querying for time-stamped data while maintaining full SQL compatibility and ACID properties.
As TimescaleDB automates aggregation and data retention through automatic partition management and downsampling, Cloudflare was able to simplify its internal infrastructure by integrating the PostgreSQL extension into its existing setup. Cepa concludes:
Not every team needs a hyper-specialized race car that requires 100 octane fuel, carbon ceramic brakes, and ultra-performance race tires: while each one of these elements boosts performance, there’s a real cost towards having those items in the form of maintenance and uniqueness. For many teams at Cloudflare, TimescaleDB strikes a phenomenal balance between the simplicity of storing your analytical data under the same roof as your configuration data, while also gaining much of the impressive performance of a specialized OLAP system.
Benchmarking a TimescaleDB compressed hypertable against a PostgreSQL table, Cepa measured performance improvements of 5x to 35x, depending on the query type and time range, thanks to compression and sparse indexes. The community mainly questioned the decision not to use Clickhouse. On Hacker News, user arunmu comments:
The reason given for not using Clickhouse which they are already using for analytics was vague and ambiguous. Clickhouse does support JSON which can be rewritten into a more structured table using MV. Aggregation and other performance tuning steps are the bread and butter of using Clickhouse.
Ajay Kulkarni, cofounder of TigerData, the company behind TimescaleDB, replies:
PostgreSQL with TimescaleDB did the job. Why overcomplicate things?
Jamie Lord, solution architect at CDS UK, writes:
For teams already invested in the PostgreSQL ecosystem, this represents a compelling evolution rather than revolution. You retain all existing tooling, knowledge, and operational procedures whilst gaining analytical capabilities that rival purpose-built OLAP systems.
Source: Cloudflare blog
Following the implementation of the DEX project, TimescaleDB has been adopted as the aggregation layer on top of raw logs in other Cloudflare projects, such as Zero Trust Analytics & Reporting, to generate analytics and long-term reports for systems ingesting millions of rows per second.