What Partners and Businesses Can Learn from the Facebook Outage
I got the first alert at 11:40 a.m. EST on Oct. 4 that there were problems with Facebook. As it's not business-critical for me, I didn't pay much attention -- but I did get puzzled when I couldn't connect to WhatsApp, as that is indeed a critical tool for me to interact with my different teams around the globe.
We've now learned that the outage lasted for six hours and involved not just Facebook but also services owned by it, like Instagram, Messenger, WhatsApp and Oculus VR. This was a costly outage for every business that depends on these services, and it shows how business-critical these social media resources have become.
Having led a large multinational hosting business, I know that sometimes problems occur that affect uptime. And any CEO in the hosting or managed services business knows that such incidents can have a big impact on reputation. Most outages aren't as big as the one that affected Facebook this week, but sometimes they are devastating.
Some problems should be expected, and you can take reasonable efforts to prepare for them. According to Facebook, the outage originated from an upgrade of routers. The ensuing problems shouldn't have been a surprise for anyone who works with infrastructure.
What is a surprise is that so much was connected to these routers. Not only did all of those customer-facing services go down, but Facebook's own e-mail system and a bunch of other internal systems -- including the entrance to the Facebook office building -- stopped working. To put it mildly, it looks like Facebook made the mistake of putting all of its eggs in the same basket and not following best practices for an enterprise-class online infrastructure.
Here's some advice not only for Facebook, but for everyone -- including partners -- running and managing business-critical infrastructure:
- Segment your infrastructure so that a problem doesn't spread across your whole environment. Your administrative network should be separated from the network where your customer-facing systems reside. Even if you're not as big as Facebook, separate your different services into several networks. This will also help security, as it will make it much harder for attackers to bring your entire environment down.
- Plan your upgrade. Make sure that it has been thoroughly analyzed and vetted. The higher its potential to impact business, the more you should plan and analyze prior to the actual upgrade taking place. Make sure that you have a decent change-management process in place.
- Never upgrade everything if you can avoid it. Simulate the upgrade in a test environment, then start the upgrade with something less business-critical than a system that is used by 3.5 billion users. The "big bang" model of upgrades fails way too often.
- Make sure that you know how to roll back an upgrade quickly and safely. Learn the right procedures for how to make it happen.
- Rehearse frequently so you know what to do when something goes wrong. It's like a fire drill; you should have procedures and protocols to follow.
- When all of your services are up again, make sure to create a written incident report and discuss the findings inside your organization. This is how my old company learned from past errors. Our mantra was that the same problem should never happen again.
Hope this will help you to prepare for the unexpected.
Posted by Per Werngren on October 05, 2021