News
        
        Microsoft Offers Explanation for December Windows Azure Outage 
        
        
        
			- By Kurt Mackie
- January 17, 2013
Customers of Microsoft's Windows Azure Storage service who were affected by December's two-day disruption received  a detailed apology and  explanation from Microsoft on Wednesday.
The service disruption happened on Dec. 28, 2012 and  affected 1.8 percent of Microsoft's total Windows Azure Storage accounts in the U.S.  South region,  according to a  Microsoft blog post. Thousands of businesses may have experienced problems  with online services and Web sites, although Microsoft didn't provide a number. In  May, a Microsoft official said that Windows Azure was supporting "high  tens of thousands" of customers.
The Windows Azure Storage service was eventually fully restored  on Dec. 30, 2012, but customers initially were kept in the dark about outage details  for 1.5 hours. The problem was associated with a single "storage stamp,"  which is Microsoft's name for a regional unit consisting of multiple storage  node stacks. Microsoft's Windows Azure cloud-based service typically depends on  multiple storage stamps per region.
 
Windows Azure subscribers get updates about the service's  performance through Microsoft's Primary Service Health Dashboard. However, during  the time of this incident, they couldn't get the details for 1.5 hours because  the Dashboard relied on the very storage stamp that had experienced problems,  according to Microsoft's explanation.
 "On December 28, 2012, from 7:30 am (PST) to  approximately 9:00 am (PST) the Primary Service Health Dashboard was  unavailable, because it relied on data in the affected storage stamp,"  Microsoft explained in the blog post.
 Microsoft attributed the cause of the service disruption to  human error, but it likely was an easy error to make, given the system's complexity,  as described in the blog post. The problem arose because of the way storage  nodes are brought back into service after being taken out for maintenance. A  certain configuration that protects the nodes from being overwritten needs to  be turned on when bringing the nodes back into service, but a technician forgot  to turn on that protection, according to Microsoft. That error led to a node overwrite  and service disruption.
 The resulting two-day delay in restoring service was  associated with Microsoft's attempt to restore the data at the failed storage  stamp location with no loss of customer data. While Microsoft does have a georedundant  service for Windows Azure that could have restored the data from another location,  taking that approach would have lost about 8 GB of recent data for all of  Microsoft's Windows Azure customers.
 Microsoft's blog post indicated that the company would  credit its Windows Azure Storage customers 100 percent for this service disruption  in their December bills. Normally, Microsoft's service level agreement for Windows  Azure Storage provides for a service credit of just 10 percent (99.9 percent  uptime) or 25 percent (99 percent uptime).
 Microsoft is also promising to improve the service in the  future. It plans to improve its georeplication service to respond quicker  should another such storage service disruption occur. Procedures associated  with the Primary Service Health Dashboard failure have already been improved,  according to Microsoft's blog post. 
 However, dashboard problems during a Windows Azure service  disruption have been seen before. Last February, the dashboard went down in  association with a purported "leap  year bug" service failure. The dashboard management service was  restored after a near 24-hour blackout period.
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    Kurt Mackie is senior news producer for 1105 Media's Converge360 group.