Cloudflare Global Outage Traced To Internal Database Change

Cloudflare recently experienced a global outage caused by a database permission update, triggering widespread 5xx errors across its CDN and security services.

The disruption started around 11:20 UTC on the 18th of November, bricking access to customer sites and even locking Cloudflare’s own team out of their internal dashboard. According to a post-mortem released by CEO Matthew Prince, the root cause was a subtle regression introduced during a routine improvement to their ClickHouse database cluster.

Engineers were rolling out a change designed to improve security by making table access explicit for users. However, this update had a nasty, unforeseen side effect on the Bot Management system. A metadata query that historically returned a clean list of columns from the default database suddenly started pulling in duplicate rows from the underlying r0 database shards.

Prince explained the technical nuance in the blog post:

The change… resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, past assumptions held that the list of columns returned by a query like this would include only the ‘default’ database.

This extra data caused the “feature file”, a configuration set used to track bot threats, to double in size. Cloudflare’s core proxy software pre-allocates memory for this file as a performance optimization, but it has a hard safety limit of 200 features. When the bloated file hit the network, it smashed through that limit, causing the Bot Management module to crash.

(Source: Cloudflare blog post)

The incident was challenging to diagnose due to its presentation. Since the database updates were rolling out gradually, the system kept flipping between a “good” state and a “bad” state every few minutes. This erratic behavior initially convinced the engineering team that they were fighting a hyper-scale DDoS attack rather than an internal bug. Confusion peaked when Cloudflare’s external status page also went down, a complete coincidence that led some to believe the support infrastructure was being targeted.

A respondent on a Reddit thread commented:

You don’t realize how many websites use Cloudflare until Cloudflare stops working. Then you try to look up how many websites use Cloudflare, but you can’t because all the Google results that would answer your question also use Cloudflare.

“That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team,” Prince wrote, noting this was the company’s most significant outage since 2019.

While users struggled with the outage, Dicky Wong, CEO of Syber Couture, pointed to the incident as a validation of multi-vendor strategies. In response to the event, he commented that while Cloudflare offers a brilliant suite of tools, “love is not the same as marriage without a prenup.” Wong argues that risk management requires a lifestyle shift towards active multi-hybrid strategies to avoid the “single-point-of-failure physics” that defined this outage.

This sentiment was echoed by users on the r/webdev subreddit, where user crazyrebel123 noted the fragility of the current internet landscape:

The problem nowadays is that you have a few large companies that runs or owns the majority of things on the internet. So when one of them goes down, the entirety of the internet goes either way it. Most sites now run on AWS or some form of other cloud service.

Senior Technology Leader Jonathan B. reinforced this view on LinkedIn, criticizing organizations’ tendency to bet the farm on a single vendor for the sake of “simplicity.”

It’s simple, yes — right up until that vendor becomes the outage everyone is tweeting about… People call hybrid ‘old school,’ but honestly? It’s just responsible engineering. It’s acknowledging that outages happen, no matter how big the logo is on the side of the cloud.

Service was eventually restored by manually pushing a known-good version of the configuration file into the distribution queue. Traffic flows normalized by 14:30 UTC, with the incident fully resolved by late afternoon. Cloudflare says it is now reviewing failure modes across all its proxy modules to ensure memory pre-allocation limits handle bad inputs more gracefully in the future.