What Happened with This Week's Azure AD Outage
- By Kurt Mackie
- March 16, 2021
Microsoft has released a "preliminary root cause analysis" for Monday's Azure Active Directory issue that took out multiple Microsoft 365 and cloud applications for at least two hours.
According to Microsoft, an internal "cross-cloud migration" operation, aiming to improve the Azure AD service, ended up disrupting services -- specifically, "Azure Admin Portal, Teams, Exchange, Azure KeyVault, SharePoint, Storage and other major applications" -- for some organizations.
All services were restored, with the possible exception of "Intune and the Microsoft Managed Desktop," according to the Microsoft 365 Status Twitter feed.
The incident occurred because Microsoft had retained a key from expiring, which was done to carry out the Azure AD migration. However, the key's retain state was ignored by Microsoft's automated process. That circumstance caused tokens signed with that key to be distrusted, leading to the service disruptions. Microsoft later rolled back operations to a prior state to address the issues.
Here's Microsoft's timeline describing the Azure AD problems:
- March 15 (Monday) at approximately 19:00 UTC (12 noon PDT): Users start to see authentication errors for any application that uses the Azure AD service.
- March 15 (Monday) at 21:05 UTC (2:05 p.m. PDT): Microsoft has rolled back the metadata for the key to its prior state. Application services start to recover, with the exception of some "Storage resources."
- March 16 (Tuesday) at approximately 9:25 UTC (2:25 a.m. PDT): Microsoft determines that most of the issues, including for Storage resources, have been mitigated for customers.
Safe Deployment Process
According to the notice, Microsoft is currently engaged in a two-stage process to improve the Azure AD service, including an effort to avoid the very same problems that occurred when Microsoft needed to alter a key. The process aims to add a "backend Safe Deployment Process (SDP) system to prevent a class of risks including this problem," the notice explained.
Microsoft has already completed the first stage of this Safe Deployment Process for the Azure AD service. The second stage is planned for completion in "mid-year."
Azure AD as 'Achilles Heel'
The March 15 incident wasn't the first of its kind. Microsoft admitted that "a previous Azure AD incident occurred on September 28th, 2020" in much the same way. However, the Safe Deployment Process, when it gets completed, will address such "class of risks," Microsoft promised.
Users of Microsoft 365 services have experienced various outages tied to the Azure AD service. A couple of years ago, possible configuration changes by Microsoft caused 2.5 hours of downtime. In 2018, a lightning strike disrupted an Azure AD hub in Texas, causing outages of more than a day.
Tony Redmond, a Microsoft Most Valuable Professional, described the Azure AD outage in this Practical 365 post, noting that the outage only affected Microsoft 365 users that needed to make an Azure AD call for authentication. He also referred to Azure AD as the "Achilles heel" of Microsoft 365 services.
While Microsoft is planning to enhance the service-level agreement of Azure AD to 99.99 percent, that change will only take effect on April 1, 2021, and just for Azure AD Premium licensees, Redmond noted.
Kurt Mackie is senior news producer for 1105 Media's Converge360 group.