Unless you’ve been on a “digital cleanse” this week, you know that Amazon Web Services (AWS) had a major outage at the start of the week.
You know this because apps and sites you use were down. Credible reports estimate at least 1,000 sites and apps were affected. Large swaths of modern digital life went dark: from finance (Venmo and Robinhood) to gaming (Roblox and Fortnite) to communications (Signal and Slack). Some people couldn’t even get a good night’s sleep because the outage took out “smart beds.” Even sporting events were impacted when Ticketmaster failed.
We’ve seen outages before, but this one seemed broader and harder to ignore.
In the wake of the outage, many well-intentioned hot takes boiled down to: “They should’ve used more cloud providers.”
Setting aside the subtle victim-blaming, there’s also the fact that in a world with only three major cloud providers (AWS, Microsoft Azure, Google Cloud) if you want to “diversify” there’s not a lot of diversity out there.
And the argument for diversity in cloud providers is really about market diversity, not individual organizations juggling multiple vendors. More competition in the cloud market would mean fewer cascading failures when one provider goes down.
The key question when something like this happens is whether we’re taking the risk lessons and expanding them beyond the immediate problem to see the emerging problems.
Instead of saying organizations need to have multiple cloud providers, we should be asking how we’re dealing with the reality of highly concentrated risks with exceptionally broad impact because we just had an object lesson in what that really means.
In this recent outage there’s a pointer to where we should be looking proactively to apply this lesson: generative AI. This recent AWS outage gives us two lessons for the emerging generative AI ecosystem.
Concentration crisis in AI
With the generative AI ecosystem, I’m talking not about chatbots — I mean AI-native applications that are built on generative AI as a platform. We just saw that when there’s no cloud, there’s no cloud-native application. Likewise, when there’s no generative AI provider, there’s no AI-native application.
The first lesson from the AWS outage for AI-native applications is what happens to an industry when there’s a limited number of providers for centralized resources and there’s an outage. We just saw: it has huge rippling effects across the industry and all walks of life built on it.
It’s a throwback to the mainframe era: when “the computer” is down, it’s down for everyone.
There are as few, if not fewer, generative AI providers as there are cloud providers. A major outage is inevitable — that’s just engineering reality. When that happens, every AI-native app built on that generative AI platform will also go down, full stop.
The impact could be even more severe than the AWS outage. It will be more like “the computer is down, and the people are gone” for many different industries and services. Ironically, the “smarter” the industry and service, the greater the potential fallout.
The second lesson is one of intertwined risk. OpenAI itself was affected by this week’s AWS outage.
That means AI-native apps have double exposure to the risks around a limited number of providers for critical, centralized resources. For AI-native apps, it’s like the mainframe era squared. If the generative AI platform fails, everything built on it fails. And if the cloud that hosts the AI platform fails, it all goes down, too.
This is not to say don’t do cloud or don’t do AI. But it is to say we need to understand this new, complex intertwining of risks inherent in a world where everything is relying on a small number of key providers and that small number of key providers also rely on a small number of key providers.
The realities of physical requirements and capital investment required for cloud and generative AI make a truly diverse ecosystem impracticable for either. I don’t think anyone sees more than a literal handful of providers for either of these in the future.
The bottom line
Highly concentrated risks with exceptionally broad impact aren’t going away anytime soon.
But the growth of generative AI providers — and their reliance on cloud providers — show where there is going to be growth and where and what those risks will be. The growth will be upwards, as technologies stack on top of and rely on each other. And that means these risks are only going to become more concentrated and the impacts even broader.
In the world of security, there’s the “CIA” triad: “confidentiality”, “integrity” and “availability.” In the first days of “Trustworthy Computing” at Microsoft, the principles included “availability.” But in recent years, availability has been overlooked often as security and privacy concerns understandably dominate.
A thoughtful application of the AWS outage tells us that outages like this are a kind of problem that isn’t an anomaly: it’s inherent in the nature of today’s technology reality. And since there are no easy solutions and only increasingly complex problems around this, we need to start understanding this new reality and thinking seriously about how to mitigate these risks.