Live! 360: 15 Lessons Microsoft Learned Running DevOps
Microsoft Visual Studio Team Services/Team Foundation Server (VSTS/TFS) isn't just a toolset for DevOps; the large team at Microsoft behind the products is a long-running experiment in doing DevOps.
During the main keynote at the Live! 360 conference in Orlando, Fla., this week, Buck Hodges shared DevOps lessons learned at Microsoft scale. While Microsoft has tens of thousands of developers engaged to varying degrees in DevOps throughout the company, Hodges, director of engineering for Microsoft VSTS, focused on the 430-person team developing VSTS/TFS.
VSTS and TFS, which share a master code base, provide a set of services for software development and DevOps, providing services such as source control, agile planning, build automation, continuous deployment, continuous integration, test case management, release management, package management, analytics and insights, and dashboards. Microsoft updates VSTS every three weeks, while the schedule for new on-premises versions of TFS is every four months.
Hodges' narrower lens on the VSTS/TFS team provides a lengthy and deep set of experiences around DevOps. Hodges started on the TFS team in 2003 and helped lead the transformation into cloud as a DevOps team with VSTS. The group's real trial by fire in DevOps started when VSTS went online in April 2011.
"That's the last time we started from scratch. Everything's been an upgrade since then. Along the way, we learned a lot, sometimes the hard way," Hodges said.
Here are 15 DevOps tips gleaned from Hodges' keynote. (Editor's Note: This article has been updated to remove an incorrect reference to "SourceControl.Revert" in the third tip.)
1. Use Feature Flags
The whole point of a fast release cycle is fixing bugs and adding features. When it comes to features, Microsoft is using a technique called feature flags that allows them to designate how an individual feature gets deployed and to whom.
"Feature flags have been an excellent technique for us for both changes that you can see and also changes that you can't see. It allows us to decouple deployment from exposure," Hodges said. "The engineering team can build a feature and deploy it, and when we actually reveal it to the world is entirely separate."
The granularity allowed by Microsoft's implementation of feature flags is surprising. For example, Hodges said the team added support for the SSH protocol, which is a feature very few organizations need, but those that do need it are passionate about it. Rather than making it generally available across the codebase, Hodges posted a blog asking users who needed SSH to e-mail him. The VSTS team was able to turn on the feature for those customers individually.
2. Define Some Release Stages
In most cases, Microsoft will be working on features of interest to more than a handful of customers. By defining Release Stages, those features can get flagged for and tested by larger and larger circles of users. Microsoft's predefined groups before a feature is fully available are:
- Stage 0: internal Microsoft
- Stage 1: a few external customers
- Stage 2: private preview
- Stage 3: public preview
3. Use a Revert Button
Wouldn't it be nice to have a big emergency button allowing you to revert from a feature if it starts to cause major problems for the rest of the code? That's another major benefit of the feature flag approach. In cases where Microsoft has found that a feature is causing too many problems, it's possible to turn the feature flag off. The whole system then ignores the troublesome feature's code, and should revert to its previous working state.
4. Make It a Sprint, not a Marathon
In its DevOps efforts around VSTS/TFS, Microsoft is working in a series of well-defined sprints, and that applies to the on-premises TFS, as well as the cloud-based VSTS. You could think of the old Microsoft development model as a marathon, working on thousands of changes for an on-premises server and releasing a new version every few years.
The core timeframe for VSTS/TFS is three weeks. At the end of every sprint, Microsoft takes a release branch that ships to the cloud on the service. Roughly every four months, one of those release branches becomes the new release of TFS. This three-week development motion is pretty well ingrained. The first sprint came in August 2010. In November 2017, Microsoft was working on sprint No. 127.
5. Flip on One Feature at a Time
No lessons-learned list is complete without a disaster. The VSTS team's low point came at the Microsoft Connect() 2013 event four years ago. The plan was to wow customers with a big release. An hour before the keynote, Microsoft turned on about two dozen new features. "It didn't go well. Not only did the service tank, we started turning feature flags off and it wouldn't recover," Hodges said, describing the condition of the service as a "death spiral."
It was two weeks before all the bugs were fixed. Since then, Microsoft has taken to turning on new features one at a time, monitoring them very closely, and turning features on completely at least 24 hours ahead of an event.
6. Split Up into Services
One other big change was partly a response to the Microsoft Connect() 2013 incident. At the time of the big failure, all of VSTS ran as one service. Now Microsoft has split that formerly global instance of VSTS into 31 separate services, giving the product much greater resiliency.
7. Implement Circuit Breakers
Microsoft took a page out of the Netflix playbook and implemented circuit breakers in the VSTS/TFS code. The analogy is to an electrical circuit breaker, and the goal is to stop a failure from cascading across a complex system. Hodges said that while fast failures are usually relatively straightforward to diagnose, it's the slow failures in which system performance slowly degrades that can present the really thorny challenges.
The circuit breaker approach has helped Microsoft protect against latency, failure and concurrency/volume problems, as well as shed load quickly, fail fast and recover more quickly, he said. Additionally, having circuit breakers creates another way to test the code: "Let's say we have 50 circuit breakers in our code. Start flipping them to see what happens," he said.
Hodges offered two warnings about circuit breakers. One is to make sure the team doesn't start treating the circuit breakers as causes rather than symptoms of an event. The other is that it can be difficult to understand what opened a circuit breaker, requiring thoughtful and specialized telemetry.
8. Collect Telemetry
Here's a koan for DevOps: The absence of failure doesn't mean a feature is working. In a staged rollout environment like the one Microsoft runs, traffic on new features is frequently low. As the feature is exposed through the larger concentric circles of users in each release stage, it's getting more and more hits. Yet a problem may not become apparent until some critical threshold gets reached weeks or months after full availability of a feature.
In all cases, the more telemetry the system generates, the better. "When you run a 24x7 service, telemetry is absolutely key, it's your lifeblood," Hodges said. "Gather everything." For a benchmark, Microsoft is pulling in 7TB of data on average every day.
"When you run a 24x7 service, telemetry is absolutely key, it's your lifeblood. Gather everything."
Buck Hodges, Director of Engineering, Microsoft Visual Studio Team Services
9. Refine Alerts
A Hoover-up-everything approach is valuable when it comes to telemetry so that there's plenty of data available for pursuing root causes of incidents. The opposite is true on the alert side. "We were drowning in alerts," Hodges admitted. "When there are thousands of alerts and you're ignoring them, there's obviously a problem," he said, adding that too many alerts makes it more likely you'll miss problems. Cleaning up the alert system was an important part of Microsoft's DevOps journey, he said. Microsoft's main rules on alerts now are that every alert must be actionable and alerts should create a sense of urgency.
10. Prioritize User Experience
When deciding which types of problems to prioritize in telemetry, Hodges said Microsoft is emphasizing user experience measurements. Early versions might have concluded that performance was fine as long as user requests weren't failing. But understanding user experience expectations, and understanding those thresholds when a user loses his or her train of thought due to a delay, makes it important to not only measure failure of requests but to also recognize a problem if a user request takes too long. "If a request takes more than 10 seconds, we consider that a failed request," Hodges said.
11. Optimize DRIs' Time
Microsoft added a brick to the DevOps foundation in October 2013 with the formalization of designated responsible individuals, or DRIs. Responsible for rapid response to incidents involving the production systems, the DRIs represent a formalization on the operations side of DevOps. In Microsoft's case, the DRIs are on-call 24/7 on 12-hour shifts and are rotated weekly. In the absence of incidents, DRIs are supposed to conduct proactive investigation of service performance.
In case of an incident, the goal is to have a DRI on top of the issue in five minutes during the day and 15 minutes at night. Traditional seniority arrangements result in the most experienced people getting the plum day shifts. Microsoft has found that flipping the usual situation works best.
"We found that at night, inexperienced DRIs just needed to wake up the more experienced DRI anyway," Hodges said. As for off-hours DRIs accessing a production system, Microsoft also provides them with custom secured laptops to prevent malware infections, such as those from phishing attacks, from working their way into the system and wreaking havoc.
12. Assign Shield Teams
The VSTS/TFS team is organized into about 40 feature teams of 10 engineers and a program manager or two. With that aggressive every-three-weeks sprint schedule, those engineers need to be heads down on developing new features for the next release. Yet if an incident comes up in the production system involving one of their features, the team has to respond. Microsoft's process for that was to create a rotating "shield team" of two of the 10 engineers. Those engineers are assigned to address or triage any live-site issues or other interruptions, providing a buffer that allows the rest of the team to stay focused on the sprint.
13. Pursue Multiple Theories
In the case of a live site incident, there's usually a temptation to seize on a theory of the problem and dedicate all the available responders to pursuing that theory in the hopes that it will lead to a quick resolution. The problem comes when the theory is wrong. "It's surprising how easy it is to get myopic. If you pursue each of your theories sequentially, it's going to take longer to fix the problem for the customer," Hodges said. "You have to pursue multiple theories."
In a similar vein, Microsoft has found it's important and helpful to rotate out incident responders and bring in fresh, rested replacements if an incident goes more than eight or 10 hours without a resolution.
14. Combine Dev and Test Engineer Roles
One of the most critical evolutions affecting DevOps at Microsoft over the past few years involved a companywide change in the definition of an engineer. Prior to combining them in November 2014, Microsoft had separate development engineer and test engineer roles. Now developers who build code must test the code, which provides significant motivation to make the code more testable.
15. Tune the Tests
The three-week-sprint cycle led to a simultaneous acceleration in testing processes, with the biggest improvements coming in the last three years. As of September 2014, Hodges said the so-called "nightly tests" required 22 hours, while the full run of testing took two days.
A new testing regimen breaks down testing into different levels. A new test taxonomy allows Microsoft to run the code against progressively more involved levels, allowing simple problems to be addressed quickly. The first level tests only binaries and doesn't involve any dependencies. The next level adds the ability to use SQL and file system. A third level tests a service via the REST API, while the fourth level is a full environment to test end to end. The upshot is that Microsoft is running many, many more tests against the code in much less time.
Posted by Scott Bekker on November 16, 2017