ClickHouse recently shared its migration strategy to AWS Graviton over the past six months, reporting a 25% performance improvement for end users. The engineering team outlines the steps taken to establish a performance baseline and transition the managed ClickHouse Cloud service to the new ARM deployment while handling large-scale production workloads.
Starting with a deployment primarily on x86-based instances like M5 and R5, ClickHouse highlights the challenges of migrating data-intensive applications and the performance and cost benefits of adopting Graviton instances.
Summarizing the compatibility assessment, benchmarking strategy, and risk analysis and mitigation, Kaushik Iska, engineering manager at ClickHouse, and Francesco Ciocchetti, senior cloud infrastructure engineer, write:
We had to migrate some instances that were relying on intel only codecs such as deflate_qpl and zstd_qat. (…) To accurately measure the impact of the migration, we used a ClickBench “like” benchmark to test performance across both arm64 and amd64.
Source: ClickHouse blog
Released in 2016 under an Apache 2.0 license, ClickHouse is a popular open-source, column-oriented database for real-time workloads, with ClickHouse Cloud as its managed cloud-based offering.
Due to limitations in the availability of larger Graviton instances, the ClickHouse team adopted a mixed-instance strategy, combining Graviton2 and Graviton3 general-purpose instances with local NVMe-based SSD block storage (m7gd and m6gd). All Graviton instances were managed using AWS Auto Scaling Groups and a custom ClickHouse autoscaler. Iska and Ciocchetti explain:
To ensure smooth autoscaling with mixed instance types, we adjusted memory allocation on m6gd instances to match m7gd, preventing unschedulable pods. We also implemented dynamic pod allocation, directing those under 236Gi to ARM instances using a webhook with node selection changes. Furthermore, we overprovisioned capacity and optimized autoscaler logic to manage the “architecture jump” where pods might initially be scheduled on larger x86 instances and then potentially downsized to smaller ARM instances.
Since its debut in 2018, AWS Graviton has evolved from the early Graviton1 instances with 16 Cortex-A72 cores to the latest Graviton4, which features 96 Neoverse-V2 cores and 12 channels of DDR5-5600 memory, providing significantly improved throughput and reduced latency. Reflecting on ClickHouse’s article, Corey Quinn, chief cloud economist at The Duckbill Group, writes:
Graviton is interesting, because for a while AWS was pushing it so hard that you just by default assumed that anyone talking about it was being required to do so by some contractual term or other. The reality is that it’s insanely cost effective and you should be using it.
Analyzing three different testing scenarios, the benchmarking plan demonstrated an overall 25% performance improvement across a wide range of queries. While performance gains on the CI logs cluster were modest, primarily due to being network-bound when reading from S3, ClickBench, a standardized benchmark simulating analytics workloads, showed more significant improvements, particularly in the Graviton4 scenario (r8g instances).
According to the article, nearly 80% of ClickHouse’s production vCPUs now run on Graviton3 general-purpose (m7gd or m7g) and memory-intensive (r7g and r7gd) instances. Meanwhile, AMD instances have declined to 17.32%.