Due to human error in handling a phishing report and insufficient validation safeguards in admin tools, Cloudflare experienced an incident affecting its R2 Gateway service on February 5th. As part of a routine remediation for a phishing URL, the R2 service was inadvertently taken down, leading to the outage or disruption of numerous other Cloudflare services for over an hour.
According to Cloudflare’s incident report released the following day, the R2 Gateway service was taken down by a Cloudflare employee attempting to block a phishing site hosted on the Cloudflare R2 service. All operations involving R2 buckets and objects, including uploads, downloads, and metadata operations, were affected. Matt Silverlock, Senior Director of Product at Cloudflare, and Javier Castro explain:
The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2. The action taken on the complaint resulted in an advanced product disablement action on the site that led to disabling the production R2 Gateway service responsible for the R2 API.
Source: Cloudflare blog
Cloudflare R2 storage, an S3-compatible object storage service with no egress charges, has been generally available since 2022 and is one of Cloudflare’s core offerings. While the company emphasized that the incident did not result in data loss or corruption within R2, many services were impacted in a cascading manner. Stream, Images, and Vectorize experienced downtime or significantly high error rates. Meanwhile, only a small fraction (0.002%) of deployments to Workers and Pages projects failed during the primary incident window. Silverlock and Castro add:
At the R2 service level, our internal Prometheus metrics showed R2’s SLO near-immediately drop to 0% as R2’s Gateway service stopped serving all requests and terminated in-flight requests (…) Remediation and recovery was inhibited by the lack of direct controls to revert the product disablement action and the need to engage an operations team with lower level access than is routine. The R2 Gateway service then required a re-deployment in order to rebuild its routing pipeline across our edge network.
Source: Cloudflare blog
The incident report was published just a few hours after the event, and in a popular Reddit thread, many users praised Cloudflare’s transparency and the level of detail provided. User JakeSteam writes:
Really appreciated the detailed minute by minute breakdown, helping highlight exactly why each minute of delay existed. Great work as always by cloudflare, turning something bad into a learning opportunity for all.
User Miasodasto13 adds:
Gotta love their transparency. Also, I can’t imagine the adrenaline rush of experiencing such an event as an engineer. It must feel like disarming a ticking bomb. With each minute of downtime passing, the higher the consequences.
Amanbolat Balabekov, staff software engineer at Delivery Hero, offers a different perspective:
You’d think teams would build internal tools specifically for situations like this, but ironically, Cloudflare’s tools failed precisely when they were needed most. It looks like to recover the service, they need to use the service itself, which creates this crazy cyclic dependency.
Cloudflare has outlined several remediation and follow-up steps to address the validation gaps and prevent similar failures in the future. These include restricting access to product disablement actions and requiring two-party approval for ad-hoc product disablements. Additionally, the team is working on expanding abuse checks to prevent the accidental blocking of internal hostnames, thereby reducing the blast radius of both system- and human-driven actions.