Live! 360: 15 Lessons Microsoft Learned Running DevOps -- Redmond Channel Partner

Bekker's Blog by Scott Bekker, Editor in Chief

Live! 360: 15 Lessons Microsoft Learned Running DevOps

Related: Live! 360: Microsoft's 'No. 1 Persona' for BI is the Business User

Microsoft Visual Studio Team Services/Team Foundation Server (VSTS/TFS) isn't just a toolset for DevOps; the large team at Microsoft behind the products is a long-running experiment in doing DevOps.

During the main keynote at the Live! 360 conference in Orlando, Fla., this week, Buck Hodges shared DevOps lessons learned at Microsoft scale. While Microsoft has tens of thousands of developers engaged to varying degrees in DevOps throughout the company, Hodges, director of engineering for Microsoft VSTS, focused on the 430-person team developing VSTS/TFS.

VSTS and TFS, which share a master code base, provide a set of services for software development and DevOps, providing services such as source control, agile planning, build automation, continuous deployment, continuous integration, test case management, release management, package management, analytics and insights, and dashboards. Microsoft updates VSTS every three weeks, while the schedule for new on-premises versions of TFS is every four months.

Hodges' narrower lens on the VSTS/TFS team provides a lengthy and deep set of experiences around DevOps. Hodges started on the TFS team in 2003 and helped lead the transformation into cloud as a DevOps team with VSTS. The group's real trial by fire in DevOps started when VSTS went online in April 2011.

"That's the last time we started from scratch. Everything's been an upgrade since then. Along the way, we learned a lot, sometimes the hard way," Hodges said.

Here are 15 DevOps tips gleaned from Hodges' keynote. (Editor's Note: This article has been updated to remove an incorrect reference to "SourceControl.Revert" in the third tip.)

1. Use Feature Flags
The whole point of a fast release cycle is fixing bugs and adding features. When it comes to features, Microsoft is using a technique called feature flags that allows them to designate how an individual feature gets deployed and to whom.

"Feature flags have been an excellent technique for us for both changes that you can see and also changes that you can't see. It allows us to decouple deployment from exposure," Hodges said. "The engineering team can build a feature and deploy it, and when we actually reveal it to the world is entirely separate."

**[Click on image for larger view.]** *Buck Hodges, director of engineering for Microsoft Visual Studio Team Services, makes the case for feature flags as a key element of DevOps.*

The granularity allowed by Microsoft's implementation of feature flags is surprising. For example, Hodges said the team added support for the SSH protocol, which is a feature very few organizations need, but those that do need it are passionate about it. Rather than making it generally available across the codebase, Hodges posted a blog asking users who needed SSH to e-mail him. The VSTS team was able to turn on the feature for those customers individually.

2. Define Some Release Stages
In most cases, Microsoft will be working on features of interest to more than a handful of customers. By defining Release Stages, those features can get flagged for and tested by larger and larger circles of users. Microsoft's predefined groups before a feature is fully available are:

Stage 0: internal Microsoft
Stage 1: a few external customers
Stage 2: private preview
Stage 3: public preview

3. Use a Revert Button
Wouldn't it be nice to have a big emergency button allowing you to revert from a feature if it starts to cause major problems for the rest of the code? That's another major benefit of the feature flag approach. In cases where Microsoft has found that a feature is causing too many problems, it's possible to turn the feature flag off. The whole system then ignores the troublesome feature's code, and should revert to its previous working state.

4. Make It a Sprint, not a Marathon
In its DevOps efforts around VSTS/TFS, Microsoft is working in a series of well-defined sprints, and that applies to the on-premises TFS, as well as the cloud-based VSTS. You could think of the old Microsoft development model as a marathon, working on thousands of changes for an on-premises server and releasing a new version every few years.

The core timeframe for VSTS/TFS is three weeks. At the end of every sprint, Microsoft takes a release branch that ships to the cloud on the service. Roughly every four months, one of those release branches becomes the new release of TFS. This three-week development motion is pretty well ingrained. The first sprint came in August 2010. In November 2017, Microsoft was working on sprint No. 127.

5. Flip on One Feature at a Time
No lessons-learned list is complete without a disaster. The VSTS team's low point came at the Microsoft Connect() 2013 event four years ago. The plan was to wow customers with a big release. An hour before the keynote, Microsoft turned on about two dozen new features. "It didn't go well. Not only did the service tank, we started turning feature flags off and it wouldn't recover," Hodges said, describing the condition of the service as a "death spiral."

It was two weeks before all the bugs were fixed. Since then, Microsoft has taken to turning on new features one at a time, monitoring them very closely, and turning features on completely at least 24 hours ahead of an event.

6. Split Up into Services
One other big change was partly a response to the Microsoft Connect() 2013 incident. At the time of the big failure, all of VSTS ran as one service. Now Microsoft has split that formerly global instance of VSTS into 31 separate services, giving the product much greater resiliency.

7. Implement Circuit Breakers
Microsoft took a page out of the Netflix playbook and implemented circuit breakers in the VSTS/TFS code. The analogy is to an electrical circuit breaker, and the goal is to stop a failure from cascading across a complex system. Hodges said that while fast failures are usually relatively straightforward to diagnose, it's the slow failures in which system performance slowly degrades that can present the really thorny challenges.

The circuit breaker approach has helped Microsoft protect against latency, failure and concurrency/volume problems, as well as shed load quickly, fail fast and recover more quickly, he said. Additionally, having circuit breakers creates another way to test the code: "Let's say we have 50 circuit breakers in our code. Start flipping them to see what happens," he said.

Hodges offered two warnings about circuit breakers. One is to make sure the team doesn't start treating the circuit breakers as causes rather than symptoms of an event. The other is that it can be difficult to understand what opened a circuit breaker, requiring thoughtful and specialized telemetry.

8. Collect Telemetry
Here's a koan for DevOps: The absence of failure doesn't mean a feature is working. In a staged rollout environment like the one Microsoft runs, traffic on new features is frequently low. As the feature is exposed through the larger concentric circles of users in each release stage, it's getting more and more hits. Yet a problem may not become apparent until some critical threshold gets reached weeks or months after full availability of a feature.

In all cases, the more telemetry the system generates, the better. "When you run a 24x7 service, telemetry is absolutely key, it's your lifeblood," Hodges said. "Gather everything." For a benchmark, Microsoft is pulling in 7TB of data on average every day.

"When you run a 24x7 service, telemetry is absolutely key, it's your lifeblood. Gather everything."

Buck Hodges, Director of Engineering, Microsoft Visual Studio Team Services

9. Refine Alerts
A Hoover-up-everything approach is valuable when it comes to telemetry so that there's plenty of data available for pursuing root causes of incidents. The opposite is true on the alert side. "We were drowning in alerts," Hodges admitted. "When there are thousands of alerts and you're ignoring them, there's obviously a problem," he said, adding that too many alerts makes it more likely you'll miss problems. Cleaning up the alert system was an important part of Microsoft's DevOps journey, he said. Microsoft's main rules on alerts now are that every alert must be actionable and alerts should create a sense of urgency.

10. Prioritize User Experience
When deciding which types of problems to prioritize in telemetry, Hodges said Microsoft is emphasizing user experience measurements. Early versions might have concluded that performance was fine as long as user requests weren't failing. But understanding user experience expectations, and understanding those thresholds when a user loses his or her train of thought due to a delay, makes it important to not only measure failure of requests but to also recognize a problem if a user request takes too long. "If a request takes more than 10 seconds, we consider that a failed request," Hodges said.

11. Optimize DRIs' Time
Microsoft added a brick to the DevOps foundation in October 2013 with the formalization of designated responsible individuals, or DRIs. Responsible for rapid response to incidents involving the production systems, the DRIs represent a formalization on the operations side of DevOps. In Microsoft's case, the DRIs are on-call 24/7 on 12-hour shifts and are rotated weekly. In the absence of incidents, DRIs are supposed to conduct proactive investigation of service performance.

In case of an incident, the goal is to have a DRI on top of the issue in five minutes during the day and 15 minutes at night. Traditional seniority arrangements result in the most experienced people getting the plum day shifts. Microsoft has found that flipping the usual situation works best.

"We found that at night, inexperienced DRIs just needed to wake up the more experienced DRI anyway," Hodges said. As for off-hours DRIs accessing a production system, Microsoft also provides them with custom secured laptops to prevent malware infections, such as those from phishing attacks, from working their way into the system and wreaking havoc.

12. Assign Shield Teams
The VSTS/TFS team is organized into about 40 feature teams of 10 engineers and a program manager or two. With that aggressive every-three-weeks sprint schedule, those engineers need to be heads down on developing new features for the next release. Yet if an incident comes up in the production system involving one of their features, the team has to respond. Microsoft's process for that was to create a rotating "shield team" of two of the 10 engineers. Those engineers are assigned to address or triage any live-site issues or other interruptions, providing a buffer that allows the rest of the team to stay focused on the sprint.

13. Pursue Multiple Theories
In the case of a live site incident, there's usually a temptation to seize on a theory of the problem and dedicate all the available responders to pursuing that theory in the hopes that it will lead to a quick resolution. The problem comes when the theory is wrong. "It's surprising how easy it is to get myopic. If you pursue each of your theories sequentially, it's going to take longer to fix the problem for the customer," Hodges said. "You have to pursue multiple theories."

In a similar vein, Microsoft has found it's important and helpful to rotate out incident responders and bring in fresh, rested replacements if an incident goes more than eight or 10 hours without a resolution.

14. Combine Dev and Test Engineer Roles
One of the most critical evolutions affecting DevOps at Microsoft over the past few years involved a companywide change in the definition of an engineer. Prior to combining them in November 2014, Microsoft had separate development engineer and test engineer roles. Now developers who build code must test the code, which provides significant motivation to make the code more testable.

15. Tune the Tests
The three-week-sprint cycle led to a simultaneous acceleration in testing processes, with the biggest improvements coming in the last three years. As of September 2014, Hodges said the so-called "nightly tests" required 22 hours, while the full run of testing took two days.

A new testing regimen breaks down testing into different levels. A new test taxonomy allows Microsoft to run the code against progressively more involved levels, allowing simple problems to be addressed quickly. The first level tests only binaries and doesn't involve any dependencies. The next level adds the ability to use SQL and file system. A third level tests a service via the REST API, while the fourth level is a full environment to test end to end. The upshot is that Microsoft is running many, many more tests against the code in much less time.

Posted by Scott Bekker on November 16, 2017

Free Webcast! Learn about Password Management Best Practices

Featured

Report: Cost, Sustainability Drive DaaS Adoption Beyond Remote Work

Gartner's 2025 Magic Quadrant for Desktop as a Service reveals that while secure remote access remains a key driver of DaaS adoption, a growing number of deployments now focus on broader efficiency goals.
Windows 365 Reserve, Microsoft's Cloud PC Rental Service, Hits Preview

Microsoft has launched a limited public preview of its new "Windows 365 Reserve" service, which lets organizations rent cloud PC instances in the event their Windows devices are stolen, lost or damaged.
Hands-On AI Skills Now Outshine Certs in Salary Stakes

For AI-related roles, employers are prioritizing verifiable, hands-on abilities over framed certificates -- and they're paying a premium for it.
Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Businesses risk losing up to $87 million a year if they fail to catch up with AI innovation, according to the Couchbase FY 2026 CIO AI Survey released this month.

RCP Update

Email Address*Country*

Please type the letters/numbers you see above.

Bekker's Blog by Scott Bekker, Editor in Chief

Live! 360: 15 Lessons Microsoft Learned Running DevOps

Featured

Report: Cost, Sustainability Drive DaaS Adoption Beyond Remote Work

Windows 365 Reserve, Microsoft's Cloud PC Rental Service, Hits Preview

Hands-On AI Skills Now Outshine Certs in Salary Stakes

Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Windows 365 Reserve, Microsoft's Cloud PC Rental Service, Hits Preview

Microsoft To Standardize Online Services Pricing for Volume Licensing

Microsoft Offers Support Extensions for Exchange 2016 and 2019

Dynamics 365, Power Platform and Copilot Getting AI Boosts in Wave 2 Update

Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Windows 365 Reserve, Microsoft's Cloud PC Rental Service, Hits Preview

Microsoft To Standardize Online Services Pricing for Volume Licensing

Microsoft Offers Support Extensions for Exchange 2016 and 2019

Dynamics 365, Power Platform and Copilot Getting AI Boosts in Wave 2 Update

Roadblocks in Enterprise AI: Data and Skills Shortfalls Could Cost Millions

Partner Guides

Partner's Guide to the Windows Server 2008 Deadline

Partner's Guide to Office 365 Security Costs

Partner's Guide to UCaaS

Partner's Guide to Starter Workloads in Azure

Partner's Guide to Microsoft's Fiscal Year 2019

FREE WEBCASTS FROM OUR SPONSORS

Veeam Simplified: The Easy Button for MSPs Protecting Microsoft 365

Tech Talk | Stop Buying What You Already Own: The MSP's Guide to Microsoft 365 Optimization

Veeam Data Cloud for Microsoft 365

Seamless Transitions: Migrating Your VMware Workloads to Azure

FREE WHITE PAPERS FROM OUR SPONSORS

The Easy Button eBook: Simplicity of SaaS-Based Backup for Microsoft 365

Unlocking the power of Microsoft 365 management: A toolkit for MSP success

How MSPs can future-proof Microsoft 365 management with automation and security

The future of M365 management