Amazon: Dublin Cloud Outage Not Caused by Lightning -- Redmond Channel Partner

Amazon: Dublin Cloud Outage Not Caused by Lightning

By Jeffrey Schwartz
August 18, 2011

Amazon has issued an apology and explanation in response to the massive cloud outage that crippled its Dublin datacenter last week. The upshot: Contrary to the company's original diagnosis, lightning does not appear to have been the cause.

According to Amazon's detailed post-mortem, it is not clear what caused the failure of a transformer that led to the datacenter outage on Aug. 7. In any case, the subsequent malfunction of a programmable logic controller (PLC), which is designed to ensure synchronization between generators, led to the failure of the cutover of a backup generator.

Without utility power, and with the backup generators disabled, there wasn't enough power for all the servers in the Availability Zone to continue operating, Amazon said. The uninterruptable power supplies (UPSes) also quickly drained, resulting in power loss to most of the EC2 instances and 58 percent of the Elastic Block Storage (EBS) volumes in the Availability Zone.

Power was also lost to the EC2 networking gear that connects the Availability Zone to the Internet and to other Amazon Availability Zones. That resulted in further connectivity issues that led to errors when customers targeted API requests to the impacted Availability Zone.

Ultimately, Amazon was able to bring some of the backup generators online manually, which restored power to many of the EC2 instances and EBS volumes, but it took longer to resume power to the networking devices.

Restoration of EBS took longer due to the atypically large number of EBS volumes that lost power. There wasn't enough spare capacity to support re-mirroring, Amazon said. That required Amazon to truck in more servers, which was a logistical problem as it was nighttime.

Another problem: When EC2 instances and all nodes containing EBS volume replicas concurrently lost power, Amazon said it couldn't verify that all of the writes to all of the nodes were "completely consistent." That being the case, the assumption was that the volume was in an inconsistent state, even though the volumes may have actually been consistent.

"Bringing a volume back in an inconsistent state without the customer being aware could cause undetectable, latent data corruption issues which could trigger a serious impact later," Amazon said. "For the volumes we assumed were inconsistent, we produced a recovery snapshot to enable customers to create a new volume and check its consistency before trying to use it. The process of producing recovery snapshots was time-consuming because we had to first copy all of the data from each node to Amazon Simple Storage Service (Amazon S3), process that data to turn it into the snapshot storage format, and re-copy the data to make it accessible from a customer's account. Many of the volumes contained a lot of data (EBS volumes can hold as much as 1 TB per volume)."

It took until Aug. 10 to have 98 percent of the recovery snapshots available, Amazon said, with the remaining ones requiring manual intervention. The power outage also had a significant impact on Amazon's Relational Database Service (RDS).

Furthermore, Amazon engineers discovered a bug in the EBS software that was unrelated to the power outage that affected the cleanup of snapshots.

So what is Amazon going to do to prevent a repeat of last week's events?

For one, the company is providing to add redundancy and greater isolation of its PLCs "so they are insulated from other failures." Amazon said it is working with its vendors to deploy isolated backup PLCs. "We will deploy this as rapidly as possible," the company said.

Amazon also said it will implement better load balancing to take failed API management hosts out of production. And for EBS, the company said it will "drastically reduce the long recovery time required to recover stuck or inconsistent EBS volumes" during a major disruption.

During Amazon's last major outage in late April, the company received a lot of heat for not providing better communications. "Based on prior customer feedback, we communicated more frequently during this event on our Service Health Dashboard than we had in other prior events, we had evangelists tweet links to key early dashboard updates, we staffed up our AWS support team to handle much higher forum and premium support contacts, and we tried to give an approximate time-frame early on for when the people with extra-long delays could expect to start seeing recovery," the company said.

For those awaiting recovery of snapshots, Amazon said it did not know how long the process would take "or we would have shared it." To improve communications, Amazon indicated it will expedite the staffing of the support team in the early hours of an event and will aim to make it easier for customers and Amazon to determine if their resources have been impacted.

Amazon said it will issue a 10-day credit equal to 100 percent of their usage of EBS volumes, EC2 instances and RDS database instances that were running in the affected Availability Zone in the Dublin datacenter.

Moreover, customers impacted by the EBS software bug that deleted blocks in their snapshots will receive a 30-day credit for 100 percent of their EBS usage in the Dublin region. Those customers will also have access to the company's Premium Support Engineers if they still require help recovering from the outage, Amazon said.

About the Author

Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.

Featured

Microsoft Broadens Defender Experts Portfolio with New Threat Intelligence Service and Expanded Hybrid Cloud Protection

Microsoft is extending its managed security services by adding a new Defender Experts Threat Intelligence offering.
Microsoft, 3M Expand Alliance to Pair AI Infrastructure with Enterprise Transformation

Microsoft and 3M are deepening their relationship through a new strategic agreement that addresses both the technology powering next-generation AI and the way large enterprises apply AI across their operations.
Microsoft Shifts Its Enterprise AI Strategy from Building Models to Deployment

Microsoft is investing $2.5 billion on the premise that enterprise AI's next phase will be defined less by advances in foundation models than by organizations' ability to deploy those models successfully at scale.
Microsoft Restructures Workforce, Eliminates About 4,800 Positions

Microsoft will reduce its global headcount by approximately 4,800 employees as part of a broader organizational restructuring.

Featured RCP Partners

Automox
- Elite
Impact Networking
- Elite

Want More? Check Out Our Full Directory

RCP Update

Email Address*Country*

Please type the letters/numbers you see above.