RCP Update

Sign up for our newsletter.

I agree to this site's Privacy Policy.

The Schwartz
Cloud Report

Blog archive

Amazon's Big Mistake

[UPDATE: Amazon released a detailed report explaining the cause of the outage on Friday. Read the story here.]

Amazon Web Services' four-day outage was a defining moment in the history of cloud computing -- not only for its impact but for the company's deafening silence.

The widely reported outage at Amazon's Northern Virginia datacenter left a number of sites crippled for several days, though Amazon most recently reported that service has been restored. However, the company has acknowledged that .07 percent of the Elastic Block Storage (EBS) volumes apparently won't be fully recoverable.

"Every day, inside companies all over the world, there are technology outages," Rackspace Chief Strategy Officer Lew Moorman told The New York Times. "Each episode is smaller, but they add up to far more lost time, money and business."

As for the Amazon outage, he added: "We all have an interest in Amazon handling this well." Did Amazon handle this well? Let's presume the company did everything in its power to remedy the problem and get its customers back online. Amazon has promised to issue a post-mortem once it gets everyone restored and figures out what went wrong.

But the company went dark from a communications perspective. Sure, it posted periodic updates on its Service Health Dashboard, but the company issued no other public statements on the situation as it was unfolding (though it was in direct communication with affected customers). Considering how visible Amazon technologists are on social media, including Twitter, a mere reference to the dashboard felt shallow.

"Most customers are saying today they have not been very transparent and open about what has exactly happened," Forrester analyst Vanessa Alverez told Bloomberg TV. "Their public relations to date has not been up to par."

Consider the communiqué of one of Amazon's customers affected by the outage. In a blog post called "Making it Right..." HootSuite explained to customers what happened and how it was going to make good on the downtime it experienced. Although its terms of service require reimbursement after a 24-hour outage and it was down for only 15 hours, HootSuite said it would offer credits.

"We acknowledge users were inconvenienced and we want to make things right," the company said.  "We are taking steps to increase redundancy of our services and data across multiple geographic regions. This was a bit of a unique outage which is highly unlikely to occur again, but we'll be even more prepared for future emergencies."

During the outage and as of this writing a week after it first hit, no such communication has come from Amazon. PundIT analyst Charles King said in a research note that datacenter failures, even major ones, are inevitable, but communication is critical. He wrote:

"The fact that disaster is inevitable is why good communications skills are so crucial for any company to develop, and why Amazon's anemic public response to the outage made a bad situation far worse than it needed to be. Yes, the company maintained a site that regularly updated how repairs were progressing, and, to its credit, Amazon says it will publish a full analysis of the outage after its investigation is complete.

"But while the company has been among the industry's most vocal cloud services cheerleaders, it seemed essentially tone deaf to the damage its inaction was doing to public perception of cloud computing. At the end of the day, we expect Amazon will use the lessons learned from the EC2 outage to significantly improve its service offerings. But if it fails to closely evaluate communications efforts around the event, the company's and its customers' suffering will be wasted."

I remember during the dotcom boom over a decade ago when companies like Charles Schwab, E-Trade and eBay had highly visible outages that affected many thousands of customers. They took big PR hits for their lack of availability but their Web businesses prospered nonetheless.

While Amazon's outage will upgrade the discussion to the importance of resiliency and redundancy (those discussions were already happening), it seems highly unlikely that it will alter the move to cloud computing, even if it serves as a historic speed bump. "We shouldn't let Amazon off the hook and should expect a very thorough postmortem. But in no way does this change the landscape for the age-old public-private debate," writes analyst Ben Kepes.

While Amazon's outage was a black eye for cloud computing, providers of all sizes, including Amazon, will undoubtedly learn from the mistakes that were made, both technical and procedural. Hopefully, that will include better communications moving forward.

Posted by Jeffrey Schwartz on April 28, 2011 at 11:58 AM


comments powered by Disqus

Reader Comments

Fri, Apr 29, 2011 Darrell Atlanta

Amazons service health dashboard was almost zero help. "We have all hands on deck and are working on it..." tells me nothing. They knew it might be 36 hrs for some customers to get back on so you need some ETAs. Also, telling us that we may need to use our backups in interim was info that was shared in some of their forums but not on the Health Dashboard. So though Jeffrey might be right to say one line of communication is most effective, that's only true when it's complete and thorough.

Thu, Apr 28, 2011 Glenn Weinstein San Mateo, CA

Jeffrey, you're right about "crisis communications" perhaps being Amazon's biggest failing here. That said, Steve Job's quote in today's NY Times (http://nyti.ms/gTMjxk), about Apple's response to the iPhone location data flare-up, was instructive: "Mr. Jobs defended the timing of Apple’s response to the controversy, saying that “rather than run to the P.R. department,” it set out to determine exactly what happened. “The first thing we always do when a problem is brought to us is we try to isolate it and find out if it is real,” he said. “It took us about a week to do an investigation and write a response, which is fairly quick for something this technically complicated.” He added, “Scott and Phil and myself were all involved in writing the response because we think it is that important.”" There is something to be said for that approach, no?

Thu, Apr 28, 2011 Mary

True, but they could say "Hey, we're really sorry" in blog posts and whatnot, and refer people to the status dashboard, so it looked like they gave a damn.

Thu, Apr 28, 2011 Jeffrey

You can't use phrases like "deafening silence" and "went dark" to describe what happened here, and then go on to say that Amazon actually provided updates on their system status page. Which is it? It can't be both. The system status page is the one and only place to get information about the health of the Amazon platform. There is no reason to provide corporate press releases (etc.) during an outage, particularly when the outage has not yet been resolved. One of the things that you learn when you manage a platform is that having multiple communication channels doesn't help you communicate more effectively. This becomes a problem for platform users because it forces them to monitor every channel you use. Having one true place to get updates (in this case, the system status dashboard) is far preferable to having to slog through blog posts or whatnot.

Add Your Comments Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above