Microsoft: Unknown Flaw Caused Hours-Long Azure Outage
- By Kurt Mackie
- November 20, 2014
Microsoft on Thursday issued a preliminary explanation for the widespread outages that hit its Azure cloud service on Tuesday evening.
The outages lasted up to 11 hours, though Microsoft said that Azure components were "working properly" as of early Wednesday. The one exception remains with Azure Virtual Machines in West Europe, where some virtual machines are reportedly in a "start state."
Jason Zander, corporate vice president on the Azure team, provided a summary analysis of the outage in a blog post Thursday. Azure had an unknown software flaw that was "discovered" after the rollout of a software patch to Azure datacenters, affecting operations worldwide, according to Zander. That's the basic story, but Microsoft also admits that it didn't follow "the standard protocol of applying production changes in incremental batches" before rolling out its software patch globally.
Zander indicated that "a limited subset of customers are [sic] still experiencing intermittent issues." He said Microsoft would provide a more detailed "root cause analysis" of the incident later on.
Here's how the incident unfolded, according to Zander's post: First, Microsoft had an undiscovered flaw in the Blob table front ends of Azure. Unfortunately, an update that was expected to improve Azure Storage Services surfaced this flaw in the Blob table front ends. The flaw caused an infinite loop in which the Blob front ends stopped taking on traffic, affecting services worldwide. Quite a lot of Azure services depend on Azure Storage Services, so the infinite loop problem affected related Azure services, such as Virtual Machines and Websites, among others. While Microsoft attempted to address the problem, doing so entailed restarting the Blob front ends, which further delayed the recovery.
As with past Azure outages, this outage affected the Service Health Dashboard, which is the portal that customers use for understanding the state of various Azure services. Microsoft couldn't update the Service Health Dashboard for three hours after the outage. Consequently, it used Twitter and other social media to report the problems, according to Zander's post.
The outage also affected the ability of Azure customers to use the Service Health Dashboard to actually report support cases "during the early phase of the outage," according to Zander.
Zander promised the following steps would be followed by Microsoft:
- Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
- Improve the recovery methods to minimize the time to recovery.
- Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
- Improve Service Health Dashboard Infrastructure and protocols.
In reaction to Microsoft's explanation, Microsoft MVP Aidan Finn offered thoughtful comments about Microsoft's production change process. He suggested in a Petri IT Knowledgebase article that IT pros maybe have a different perspective about updating systems than Microsoft's Azure developers. For instance, IT pros typically run tests of new software updates before delivering them to their production systems.
Finn also wondered about Microsoft's communications to its customers. While Microsoft used Twitter when its Service Health Dashboard wasn't functioning, Finn noted that Microsoft has the e-mail address of "every subscriber owner and delegate administrator."
Kurt Mackie is senior news producer for the 1105 Enterprise Computing Group.