The massive outage that hit Amazon Web Services early Monday and took down several major sites and services was due to an internal issue within the cloud giant’s infrastructure.
In a new update Monday at 8:43 a.m. PT, Amazon said the root cause of the outage was an “underlying internal subsystem responsible for monitoring the health of our network load balancers.”
The outage impacted everything from sites including Facebook, Coinbase, and Amazon itself, to check-in kiosks at LaGuardia Airport.
Amazon said it was seeing connectivity and API recovery for AWS services.
Dr. Aybars Tuncdogan, an associate professor at King’s College London, said it serves as warning sign for a potentially more disruptive situation.
“If a comparable vulnerability were deliberately targeted by malicious actors, the damage would be far worse,” Tuncodgan said.
The problems began shortly after midnight Pacific in Amazon’s Northern Virginia (US-EAST-1) region, which is AWS’s oldest and largest cloud region, a popular nerve center for online services. Major outages originating from this same region also caused widespread disruptions in 2017, 2021, and 2023.
In an initial update, AWS said the outage was related to a DNS resolution issue with DynamoDB, meaning the internet’s phone book failed to find the correct address for a database service used by thousands of apps to store and find data.
The latest outage suggests that many sites have not adequately implemented the redundancy needed to quickly fall back to other regions or cloud providers in the event of AWS outages.
Tuncodgan said the deeper issue is “tech monoculture” in a global infrastructure with little diversity in platforms or providers.
“It’s like agricultural monoculture — when everything relies on a single strain, one disease can wipe out entire plantations, because they all have the same genetics,” he said.
He said that while customers can design redundancy themselves, the providers can also develop different competing infrastructures within their own ecosystems.
“This incident will likely be resolved quickly,” he said. “However, unless we rethink the architecture (that is, we decentralize and diversify), we should expect more outages of this scale, whether from glitches or targeted attacks.”