The Massive Azure Outages Are Over, But The Problems Remain. This Is What Happened

photo alliance/contributor/photo alliance via Getty Images

Follow ZDNET: Add us as a preferred source on Google.

The main conclusions of ZDNET

Microsoft Azure suffered a global outage on October 29.
Microsoft’s customer-facing services were affected.
Recovery came later that same day, but some problems still linger.

Last week, Amazon Web Services (AWS) went down, leaving many of us miserable. This week it was Microsoft Azure’s turn to fall and prosper, and once again we’re pretty unhappy about it.

Microsoft said the latest Azure outage began around noon ET on October 29. However, Downdetector, which relies on user reports, showed that the problems came to light earlier, around 11:40 am.

Also: The massive AWS outage that broke half the internet is finally over – here’s what happened

ThousandEyes, Cisco’s network security company, “detects HTTP timeouts, server error codes, and increased packet loss at the edge of Microsoft’s network, preventing successful connections to affected services and frequently causing timeouts or recurrence of service-related errors.”

The latest status update

On October 29 at 5:30 PM ET, Microsoft reported: “We have begun deploying our ‘last known good’ configuration, which has now completed successfully. We are currently restoring nodes and redirecting traffic through healthy nodes.”

Also: no one is paying ransomware demands anymore, so attackers have a new target

However, Microsoft continues: “As recovery continues, some requests may still end up on unhealthy nodes, resulting in periodic failures or reduced availability until more nodes are fully recovered. This recovery effort includes reloading configurations and redistributing traffic across a large number of nodes to restore full operational scale. The process is gradual in design, ensuring stability and preventing overload as dependent services recover. We expect continued improvement in the affected regions. This means we expect recovery to occur by 23:20 UTC on October 29, 2025.”

That’s 7:30 PM ET.

In reality it took a little longer. Azure reported that it was back to normal yesterday at 8:05 PM. Even then, Microsoft warned that customer configuration changes to Azure Front Door (AFD) would remain temporarily blocked. Microsoft promised that it would notify customers once this block is lifted. While “error rates and latency have returned to pre-incident levels, a small number of customers may still experience issues, and we are still working to mitigate this long tail.”

If you are still having issues today, please contact Azure. If things are really messed up, Microsoft recommends that you implement existing failover strategies using Azure Traffic Manager to redirect traffic from Azure Front Door to their origin servers as an interim measure. “This is far from an easy solution. If your staff has no experience with Azure traffic routing, I would grit my teeth and wait for Azure to come fully back online.

Unlike the AWS outage, which — while massive in damage — was limited to a single region (AWS East), all Azure regions were offline as of 1:30 PM ET, according to the Azure Status page.

Tracing the incorrect implementation

We still don’t have a definitive report on what happened. Initially, Microsoft only said: “Starting at approximately 4:00 PM UTC, we began experiencing issues with Azure Front Door (AFD), resulting in a loss of availability of some services. We suspect that an inadvertent configuration change was the trigger for this issue. We are taking two simultaneous actions where we block all changes to the AFD services while simultaneously reverting back to our last known good state.”

Microsoft’s initial report on the incident stated: “An inadvertent tenant configuration change within AFD caused widespread service disruption that affected both Microsoft services and customer applications that rely on AFD for global content delivery.” The change caused an invalid configuration state, which in turn caused a significant number of AFD nodes to fail to load properly, including increased latencies, timeouts, and connection errors for downstream services. In other words, it was a complete mess.

Also: Best VPN services 2025: Our top picks for speed and security

As unhealthy nodes disappeared from the global pool, traffic distribution across healthy nodes became unbalanced, amplifying the impact and causing intermittent availability even in partially healthy regions. Microsoft “immediately blocked all further configuration changes to prevent further spread of the failed state and began deploying a ‘last known good’ configuration across the world. Recovery required reloading configurations across a large number of nodes and gradually rebalancing traffic to avoid overload as nodes were returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and preventing the problem from reoccurring.”

The issue can be traced back to an incorrect tenant configuration deployment process. “Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software bug that allowed the deployment to bypass security validations. Since then, security measures have been revised and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.”

Although not mentioned in this document, some of the blame was placed on it in early Azure reports: you guessed it! — a DNS (Domain Name System) problem. Say it with me: If there’s a network problem: “It’s always DNS!”

Which sites and services are affected?

Ordinary people felt the pain too. Popular services like Microsoft 365 and Microsoft Intune for business users and Xbox Live and Minecraft for people who just want to have fun are also down. Others reported that Microsoft logins were also slow or failed altogether.

The following services were affected:

Microsoft365
Microsoft Azure
Microsoft Copilot
Microsoft login
Microsoft store
Microsoft Teams
Minecraft
Xbox

It was a bad day if you trusted Microsoft.

Alaska Airlines experienced disruptions to its critical internal systems, including its website and operational infrastructure. Vodafone in the United Kingdom and Heathrow Airport are also said to have been affected by the outage.

Behind the scenes, Microsoft is now reporting that the following Azure services have been affected: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Container Registry, Media Services, Microsoft Defender External Attack Surface Management, Microsoft Entra ID, Microsoft Purview, Microsoft Sentinel, Video Indexer, and Virtual Desktop.

Earlier, telecoms analyst Luke Kehoe of Ookla said: “Microsoft Azure has taken many services offline globally, with a large impact on airlines, banks and government agencies. It is the second such event this month, highlighting the systemic risks of concentration and some points of logical failure, regardless of how physically hardened the infrastructure is.”

Also: Microsoft’s updated Windows 11 Start menu is rolling out, but I’m sticking with my favorite alternative

He has a point. We rely too heavily on AWS, Azure and other cloud services, which, when the going gets tough, prove to be single points of failure.

Be that as it may, Microsoft reported in its latest quarterly report, which came the same day after the bell, that it beat Wall Street expectations and that Azure revenue grew by about 40%. But this continued failure and Microsoft’s admission that it cannot meet AI and cloud demands caused Microsoft’s share of the aftermarket business to decline.

Get the morning’s top stories delivered to your inbox every day with our Tech Today Newsletter.