Microsoft: Unknown Flaw Caused Hours-Long Azure Outage -- Redmond Channel Partner

Microsoft: Unknown Flaw Caused Hours-Long Azure Outage

By Kurt Mackie
November 20, 2014

Microsoft on Thursday issued a preliminary explanation for the widespread outages that hit its Azure cloud service on Tuesday evening.

The outages lasted up to 11 hours, though Microsoft said that Azure components were "working properly" as of early Wednesday. The one exception remains with Azure Virtual Machines in West Europe, where some virtual machines are reportedly in a "start state."

Jason Zander, corporate vice president on the Azure team, provided a summary analysis of the outage in a blog post Thursday. Azure had an unknown software flaw that was "discovered" after the rollout of a software patch to Azure datacenters, affecting operations worldwide, according to Zander. That's the basic story, but Microsoft also admits that it didn't follow "the standard protocol of applying production changes in incremental batches" before rolling out its software patch globally.

Zander indicated that "a limited subset of customers are [sic] still experiencing intermittent issues." He said Microsoft would provide a more detailed "root cause analysis" of the incident later on.

Here's how the incident unfolded, according to Zander's post: First, Microsoft had an undiscovered flaw in the Blob table front ends of Azure. Unfortunately, an update that was expected to improve Azure Storage Services surfaced this flaw in the Blob table front ends. The flaw caused an infinite loop in which the Blob front ends stopped taking on traffic, affecting services worldwide. Quite a lot of Azure services depend on Azure Storage Services, so the infinite loop problem affected related Azure services, such as Virtual Machines and Websites, among others. While Microsoft attempted to address the problem, doing so entailed restarting the Blob front ends, which further delayed the recovery.

As with past Azure outages, this outage affected the Service Health Dashboard, which is the portal that customers use for understanding the state of various Azure services. Microsoft couldn't update the Service Health Dashboard for three hours after the outage. Consequently, it used Twitter and other social media to report the problems, according to Zander's post.

The outage also affected the ability of Azure customers to use the Service Health Dashboard to actually report support cases "during the early phase of the outage," according to Zander.

Zander promised the following steps would be followed by Microsoft:

Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed.
Improve the recovery methods to minimize the time to recovery.
Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production.
Improve Service Health Dashboard Infrastructure and protocols.

In reaction to Microsoft's explanation, Microsoft MVP Aidan Finn offered thoughtful comments about Microsoft's production change process. He suggested in a Petri IT Knowledgebase article that IT pros maybe have a different perspective about updating systems than Microsoft's Azure developers. For instance, IT pros typically run tests of new software updates before delivering them to their production systems.

Finn also wondered about Microsoft's communications to its customers. While Microsoft used Twitter when its Service Health Dashboard wasn't functioning, Finn noted that Microsoft has the e-mail address of "every subscriber owner and delegate administrator."

About the Author

Kurt Mackie is senior news producer for 1105 Media's Converge360 group.

Featured

Microsoft Acquires Osmos, Automating Data Engineering inside Fabric

In a strategic move to reduce time-consuming manual data preparation, Microsoft has acquired Seattle-based startup Osmos, specializing in agentic AI for data engineering.
Linux Foundation Unites Major Tech Firms to Launch Agentic AI Foundation

The Linux Foundation today announced the creation of a new collaborative initiative — the Agentic AI Foundation (AAIF) — bringing together major AI and cloud players such as Microsoft, OpenAI, Anthropic and other major tech companies.
Microsoft 365 Commercial Subscription Prices Set to Hike Mid 2026

Microsoft announced a pricing increase for many commercial Microsoft 365 subscriptions (and related Office/Enterprise suites), with the changes expected to take effect in July 2026.
IBM Snaps Up Confluent in an $11 Billion Deal to Supercharge AI Data Infrastructure

IBM is acquiring data streaming leader Confluent for $11 billion in a bold move to bolster its AI infrastructure stack.