A detailed explanation of this week’s Amazon Web Services outage, released Thursday morning, confirms that it wasn’t a hardware glitch or an outside attack but a complex, cascading failure triggered by a rare software bug in one of the company’s most critical systems.
The company said a “faulty automation” in its internal systems — two independent programs that began racing each other to update records — erased key network entries for its DynamoDB database service, triggering a domino effect that temporarily broke many other AWS tools.
AWS said it has turned off the flawed automation worldwide and will fix the bug before bringing it back online. The company also plans to add new safety checks and improve how quickly its systems recover if something similar happens again.
Amazon apologized and acknowledged the widespread disruption caused by the outage.
“While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses,” the company said, promising to learn from the incident.
The outage began early Monday and impacted sites and online services around the world, again illustrating the internet’s deep reliance on Amazon’s cloud and showing how a single failure inside AWS can quickly ripple across the web.
Related: The AWS outage is a warning about the risks of digital dependance and AI infrastructure
