News

Microsoft Gives Postmortem of Lync, Exchange Outages

Microsoft on Thursday issued an explanation for two separate Office 365 service outages that occurred this week.

In a Microsoft forum post, Rajesh Jha, corporate vice president for Office 365 engineering, said that only Microsoft's North American datacenters were affected by Monday's Lync Online outage and Tuesday's Exchange Online outage, and that the problems causing the outages have since been fixed.

With regard to the Lync Online problem, some users in North America were affected and couldn't log into the service. Microsoft fixed that specific log-in problem "in minutes," Jha said, but that "the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration." That extended duration appears to have been a good part of the working day on June 23, according to a chronicle kept by veteran Microsoft reporter Mary Jo Foley.

The Exchange Online outage also seems to have been a small problem that just escalated after being detected. Jha explained that a directory partition stopped responding to authentication requests. That problem caused "a small set of customers to lose email access." However, the problem somehow affected Microsoft's broader e-mail traffic flow. Many Exchange Online users reported not being able to send or receive e-mail. Jha said that the initial Exchange Online failure led to an "unexpected issue":

Unfortunately, the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers. Our recovery strategy was two pronged: 1) We partitioned the mail delivery system away from the failed directory partition and 2) directly addressed the root cause for the failed directory partition. In addition to fixing the root cause trigger, we are working on further layers of hardening for this pattern.

The Exchange Online problem persisted through most of the day on June 24. Jha also noted that the Service Health Dashboard, which provides Office 365 service uptime reports to subscribers, had a problem with its "publishing process, meaning not all impacted customers were notified in a timely way." He said that the problem with the Service Health Dashboard has "since been addressed."

Microsoft plans to provide more details about the outages to its customers via a "post-incident report," which will appear in the Service Health Dashboard, Jha said. Microsoft doesn't have a publicly accessible portal showing its Office 365 service health, and so much of the news about the outages on Monday and Tuesday were initially relayed through Twitter posts.

Microsoft offers a "three nines" or 99.9 percent uptime service level agreement as part of its Office 365 business plans. If Microsoft fails to meet a 99.9 percent uptime each month, then the subscriber may be eligible to get a service credit. However, the subscriber has to file with Microsoft to get the credit. The service credit is calculated as a percentage of the monthly service fees that gets returned to the customer, depending on the degradation of service uptime. Microsoft shows those uptime percentages and corresponding service credits in the following table:

Monthly Uptime Service Credit
< 99.9% 25%
< 99% 50%
< 95% 100%

Service credit percentages based on monthly Office 365 uptime. Source: Microsoft's "Service Level Agreement for Microsoft Online Services" document.

It's estimated that a 99.9 percent uptime translates to experiencing about 43 minutes of downtime per month, or about eight hours of downtime per year. Microsoft's outages on Monday and Tuesday lasted perhaps six hours and nine hours, respectively, according to press reports.

About the Author

Kurt Mackie is senior news producer for 1105 Media's Converge360 group.

Featured

  • Microsoft Appoints Althoff as New CEO for Commercial Business

    Microsoft CEO and chairman Satya Nadella on Wednesday announced the promotion of Judson Althoff to CEO of the company's commercial business, presenting the move as a response to the dramatic industrywide shifts caused by AI.

  • Broadcom Revamps VMware Partner Program Again

    Broadcom recently announced a significant update regarding its VMware Cloud Service Provider (VCSP) program, coinciding with the release of VMware Cloud Foundation (VCF) 9.0, a key component in Broadcom’s private cloud strategy.

  • Closeup of the new Copilot keyboard key

    Microsoft Updates Copilot To Add Context-Sensitive Agents to Teams, SharePoint

    Microsoft has rolled out a new public preview for collaborative "always on" agents in Microsoft 365 Copilot, bringing enhanced, context-aware tools into Teams channels, meetings, SharePoint sites, Planner workstreams and Viva Engage communities.

  • Windows 365 Cloud Apps Now Available for Public Preview

    Microsoft announced this week that Windows 365 Cloud Apps are now available for public preview. This aims to allow IT administrators to stream individual Windows applications from the cloud, removing the need to assign Cloud PCs to every user.