The Canva engineering team recently published their post-mortem on the outage they experienced last November, detailing the API Gateway failure and the lessons learned during the incident. Brendan Humphreys, Canva’s CTO, acknowledges:
On November 12, 2024, Canva experienced a critical outage that affected the availability of canva.com. From 9:08 AM UTC to approximately 10:00 AM UTC, canva.com was unavailable. This was caused by our API Gateway cluster failing due to multiple factors, including a software deployment of Canva’s editor, a locking issue, and network issues in Cloudflare, our CDN provider.
Canva’s editor is a single-page application, deployed multiple times daily, with client devices fetching new assets through Cloudflare using a tiered caching system. However, a routing issue within the CDN provider disrupted traffic between two regions. As a result, when the assets became available on the CDN, all clients began downloading them simultaneously. This led to a sudden surge, with over 270000 pending requests being completed at the same time. Humphreys explains:
Normally, an increase in errors would cause our canary system to abort a deployment. However, in this case, no errors were recorded because requests didn’t complete. As a result, over 270,000+ user requests for the JavaScript file waited on the same cache stream.
Source: Canva Engineering Blog
Lorin Hochstein, staff software engineer at Airbnb and author of the Surfing Complexity blog, describes the outage as a tale of saturation and resilience. Hochstein highlights:
The incident wasn’t triggered by a bug in the code in the new version, or even by some unexpected emergent behavior in the code of this version. No, while the incident was triggered by a deploy, the changes from the previous version are immaterial to this outage. Rather, it was the system behavior that emerged from clients downloading the new version that led to the outage.
Suddenly, the new object panel loaded simultaneously across all waiting devices, resulting in over 1.5 million requests per second to the API Gateway, a surge approximately three times the typical peak load. This overwhelming wave caused the load balancer to transform into an “overload balancer,” turning healthy nodes into unhealthy ones. Hochstein adds:
This is a classic example of a positive feedback loop: the more tasks go unhealthy, the more traffic the healthy nodes received, the more likely those tasks will go unhealthy as well.
As autoscaling failed to keep pace, API Gateway tasks began failing due to memory exhaustion, ultimately leading to a complete collapse. To address the issue, Canva’s team attempted to manually increase capacity while simultaneously reducing the load on the nodes, achieving mixed results. The situation was finally mitigated when traffic was entirely blocked at the CDN layer. Humphreys details:
At 9:29 AM UTC, we added a temporary Cloudflare firewall rule to block all traffic at the CDN. This prevented any traffic reaching the API Gateway, allowing new tasks to start up without being overwhelmed with incoming requests. We later redirected canva.com to our status page to make it clear to users that we were experiencing an incident.
The Canva engineers gradually ramped up traffic, fully restoring it in approximately 20 minutes. In a popular HackerNews thread, John Nagle comments:
This problem is similar to what electric utilities call “load takeup”. After a power outage, when power is turned back on, there are many loads that draw more power at startup. (…) Bringing up a power grid is thus done by sections, not all at once.
While all functional requirements were initially met and the automated systems exacerbated the problem, Hochstein highlights:
It was up to the incident responders to adapt the behavior of the system, to change the way it functioned in order to get it back to a healthy state. (…) This is a classic example of resilience, of acting to reconfigure the behavior of your system when it enters a state that it wasn’t originally designed to handle.
Humphreys concludes on Linkedin:
The full picture took some time to assemble, in coordination with our very capable and helpful partners at Cloudflare (…) a riveting tale involving lost packets, cache dynamics, traffic spikes, thread contention, and task headroom.
To minimize the likelihood of similar incidents in the future, the team focused on incident response process improvements, including runbook for traffic blocking and restoration, and increased resilience of the API Gateway.