Guest Blogs

Blog archive

What Partners and Businesses Can Learn from the Facebook Outage

I got the first alert at 11:40 a.m. EST on Oct. 4 that there were problems with Facebook. As it's not business-critical for me, I didn't pay much attention -- but I did get puzzled when I couldn't connect to WhatsApp, as that is indeed a critical tool for me to interact with my different teams around the globe.

We've now learned that the outage lasted for six hours and involved not just Facebook but also services owned by it, like Instagram, Messenger, WhatsApp and Oculus VR. This was a costly outage for every business that depends on these services, and it shows how business-critical these social media resources have become.

Having led a large multinational hosting business, I know that sometimes problems occur that affect uptime. And any CEO in the hosting or managed services business knows that such incidents can have a big impact on reputation. Most outages aren't as big as the one that affected Facebook this week, but sometimes they are devastating.

Some problems should be expected, and you can take reasonable efforts to prepare for them. According to Facebook, the outage originated from an upgrade of routers. The ensuing problems shouldn't have been a surprise for anyone who works with infrastructure.

What is a surprise is that so much was connected to these routers. Not only did all of those customer-facing services go down, but Facebook's own e-mail system and a bunch of other internal systems -- including the entrance to the Facebook office building -- stopped working. To put it mildly, it looks like Facebook made the mistake of putting all of its eggs in the same basket and not following best practices for an enterprise-class online infrastructure.

Here's some advice not only for Facebook, but for everyone -- including partners -- running and managing business-critical infrastructure:

  1. Segment your infrastructure so that a problem doesn't spread across your whole environment. Your administrative network should be separated from the network where your customer-facing systems reside. Even if you're not as big as Facebook, separate your different services into several networks. This will also help security, as it will make it much harder for attackers to bring your entire environment down.
  2. Plan your upgrade. Make sure that it has been thoroughly analyzed and vetted. The higher its potential to impact business, the more you should plan and analyze prior to the actual upgrade taking place. Make sure that you have a decent change-management process in place.
  3. Never upgrade everything if you can avoid it. Simulate the upgrade in a test environment, then start the upgrade with something less business-critical than a system that is used by 3.5 billion users. The "big bang" model of upgrades fails way too often.
  4. Make sure that you know how to roll back an upgrade quickly and safely. Learn the right procedures for how to make it happen.
  5. Rehearse frequently so you know what to do when something goes wrong. It's like a fire drill; you should have procedures and protocols to follow.
  6. When all of your services are up again, make sure to create a written incident report and discuss the findings inside your organization. This is how my old company learned from past errors. Our mantra was that the same problem should never happen again.

Hope this will help you to prepare for the unexpected.

Posted by Per Werngren on October 05, 2021


Featured

  • Microsoft Appoints Althoff as New CEO for Commercial Business

    Microsoft CEO and chairman Satya Nadella on Wednesday announced the promotion of Judson Althoff to CEO of the company's commercial business, presenting the move as a response to the dramatic industrywide shifts caused by AI.

  • Broadcom Revamps VMware Partner Program Again

    Broadcom recently announced a significant update regarding its VMware Cloud Service Provider (VCSP) program, coinciding with the release of VMware Cloud Foundation (VCF) 9.0, a key component in Broadcom’s private cloud strategy.

  • Closeup of the new Copilot keyboard key

    Microsoft Updates Copilot To Add Context-Sensitive Agents to Teams, SharePoint

    Microsoft has rolled out a new public preview for collaborative "always on" agents in Microsoft 365 Copilot, bringing enhanced, context-aware tools into Teams channels, meetings, SharePoint sites, Planner workstreams and Viva Engage communities.

  • Windows 365 Cloud Apps Now Available for Public Preview

    Microsoft announced this week that Windows 365 Cloud Apps are now available for public preview. This aims to allow IT administrators to stream individual Windows applications from the cloud, removing the need to assign Cloud PCs to every user.