The Content Delivery Network (CDN) Cloudflare suffered an outage for several hours yesterday that made multitude of websites and online systems would remain inaccessible. X, Canva, the game League of Legends or ChatGPT were just some of the services that stopped working due to the fall of this provider, which serves approximately 20% of the web. Even several services in charge of detecting which websites or online services have problems, such as Downdetector, stopped working.
Once the problem is solved, the CEO de Cloudflare, Matthew Princehas offered an explanation of the problem through a post on its blog, in which it confirms that the crash was due to an internal failure, and not a cyber attack. The one described by Prince as his «worst blackout since 2019″was due to a bot management issue that deals with the control of automated crawlers that are authorized or not authorized to scan the content of the websites it serves with its CDN.
Cloudflare’s content distribution network is responsible for sharing the content load, with the aim of not only streamlining the operation of the websites it has among its clients, but also of keeping the pages online and providing service when they suffer a sudden and/or notable increase in traffic accessing them, as well as when they suffer a distributed denial of service (DDoS) attack.
Bot controllers, where the problem originated, are responsible for managing problems caused by, among others, crawlers that extract information from pages and then use it to train generative AI models. The source of your problem, however, has not been due to these trackers, a cyber attack (the initial suspicion) or due to a DNS failure, but rather it came after various changes in the permissions system of a database.
Apparently, the machine learning model that drives the aforementioned bot management system, responsible for generating scores for requests that travel through its network, has a configuration file that is updated frequently, and that collaborates in centralized identification requests.
A change «in the behavior of the ClickHouse query that generates this file caused it to have a large number of duplicate function rows«. This change caused the ClickHouse database to duplicate information, causing file size to rapidly increase until it exceeded preconfigured memory limits. The result? The fall «of the main proxy system that manages the traffic process» from Cloudflare customers. Specifically «of any traffic that depends on the bot module«.
At this point, companies using Cloudflare rules to block certain bots began returning false positives and blocking real traffic. Cloudflare customers who do not use bot scoring in their rules were not affected.
After solving the problem, the company will take measures to prevent the problem from happening again. Thus, they are going to strengthen the process of the configuration files generated by Cloudflare, to proceed with them as they would with the entries generated by users.
Additionally, they will be putting in place more “kill switches” for functions globally, as well as eliminating the possibility that crash dumps, or other bug reports, could overload system resources. Finally, they will also review the failure models for error conditions in all major proxy modules.
