Our journey to Cloud Cadence, lessons learned at Microsoft Developer Division
Author: Sam Guckenheimer
Last Update: 12/20/2016
Microsoft Developer Division transitioned to being a SaaS provider from a “box” software company. Whereas a decade ago, we delivered on-premises software releases on a multi-year cadence, now we deliver continuously from the public cloud. Our DevOps engineering practices, tools, and culture needed to evolve as we transformed to the second decade of Agile.
From Agile to DevOps at Microsoft Developer Division
On November 13, 2013, we launched Visual Studio 2013 and announced the commercial terms of our hosted service, Visual Studio Team Services. We promptly experienced a seven-hour outage. At that time, we were running our service in a single scale unit, serving about a million users when we announced. It was our peak traffic time, when Europe and both coasts of the US are online. We had many new capabilities hidden behind “feature flags” that we lifted right before the marketing announcement. We discovered that we did not have telemetry on a key piece of our network layer, the IP ports that communicate between services. And we were generating new demand. From a technical standpoint, this was a large embarrassment, as shown in Figure 1. You can follow the gory details.
From a marketing standpoint, the launch was actually a big success. This was the inflection point at which Team Services quickly started climbing to double-digit, month-over-month growth.
Balancing feature and live site work
During that time, Team Services was hosted on only one scale unit, in the Chicago data center. We knew that we would need to scale out across multiple, individually deployed stamps, but we had always prioritized feature work over the live site work to move to multiple scale units. The launch experience changed those priorities. The team decided to postpone planned feature work and to accelerate the live site work. We needed a way to canary the deployment of the Team Services updates. The existing Chicago scale unit remained unchanged, but we added another in front of it in the deployment sequence, which we named SU0.
Adding a canary to the release pipeline
In the new process, we would first deploy every sprint to San Antonio (Scale Unit 0), where we worked, and use the new release for a few hours. (We’ve since extended this to several days.) When satisfied, we would allow the deployment to continue to roll to Chicago (Scale Unit 1), as shown in Figure 2. Of course, if there is a problem in SU0, we can remediate and restart. Over time, we added several other Azure data centers. In the Fall of 2014, we added Amsterdam as the fifth, and subsequently we have added Australia, India, and Brazil as we continue to expand internationally. Visual Studio Release Management handles the workflow and automation of rolling out each sprint worldwide, grouping the data centers into successive environments, each of which is a separate exposure tier.
How we moved from Agile to DevOps
Over seven years, Microsoft Developer Division (DevDiv) embraced Agile. We had achieved a 15x reduction in technical debt through solid engineering practices, drawn heavily from XP. We trained everyone on Scrum, multidisciplinary teams and product ownership across the division.
We significantly focused on the flow of value to our customers. By the time we shipped VS2010, the product line achieved a level of customer recognition that was unparalleled.
Hosting from the Azure Public Cloud
After we shipped VS2010, we knew that we needed to begin to work on converting Team Foundation Server into a Software as a Service (SaaS) offering. The SaaS version, now called Team Services, would be hosted on Microsoft Azure, and to succeed with that we needed to begin adopting DevOps practices. That meant we needed to expand our practices from Agile to DevOps. What’s the difference?
DevOps requires Build-Measure-Learn
Part of a DevOps culture is learning from usage. A tacit assumption of Agile was that the Product Owner was omniscient and could groom the backlog correctly. In contrast, when you run a high-reliability service, you can observe how customers are actually using its capabilities in near real-time. You can release frequently, experiment with improvements, measure, and ask customers how they perceive the changes. The data you collect becomes the basis for the next set of improvements you do. In this way, a DevOps product backlog is really a set of hypotheses that become experiments in the running software and allow a cycle of continuous feedback.
As shown in Figure 3, DevOps grew from Agile based on four trends:
Delivering multi-tenant SaaS and on-premise server updates
Unlike many “born-in-the-cloud” companies, we did not start with a SaaS offering. Most of our customers are using the on-premises version of our software (Team Foundation Server, originally released in 2005 and now available in Version 2015). When we started Team Services, we determined that we would maintain a single code base for both the SaaS and “box” versions of our product, developing cloud-first. When an engineer pushes code, it triggers a continuous integration pipeline. At the end of every three-week sprint, we release to the Cloud, and after four to five sprints, we release a quarterly update for the on-premises product. You can see five years of the details on the Features Timeline.
When you are working on a service, you have the blessing of frequent releases, in our case at the end of every three-week sprint. This creates a great opportunity to expose work, and a need to control when it is exposed. Some of the issues that arise are:
- How do you work on features that span sprints?
- How do you experiment with features in order to get usage and feedback, when you know they are likely to change?
- How do you do “dark launches” that introduce services or capabilities before you are ready to expose or market them?
Feature flags allow runtime control
In all of these cases, we have started to use the feature flag pattern. A feature flag is a mechanism to control production exposure of any feature to any user or group of users. As a team working on the new feature, you can register a flag with the feature flag service, and it will default down. When you are ready to have someone try your work, you can raise the flag for that identity in production as long as you need. If you want to modify the feature, you can lower the flag with no redeployment and the feature is no longer exposed.
By allowing progressive exposure control, feature flags also provide one form of testing in production. We will typically expose new capabilities initially to ourselves, then to our early adopters (who may opt into “Private Preview”), and then to increasingly larger circles of customers. Monitoring the performance and usage allows us to ensure that there is no issue at scale in the new service components.
Code velocity and branching
When we first moved to Agile in 2008, we believed that we would enforce code quality with the right quality gates and branching structure. In the early days, developers worked in a fairly elaborate branch structure and could only promote code that satisfied a stringent definition of done, including a gated check-in that effectively did a “get latest” from the trunk, built the system with the new changesets, and ran the build policies.
The waste of merge debt
The unforeseen consequence of that branch structure was many days, sometimes months, of impedance in the flow of code from the leaf nodes to the trunk, and long periods of code sitting in branches unmerged. This created significant merge debt. When work was ready to merge, the trunk had moved considerably, and merge conflicts abounded, leading to a long reconciliation process and lots of waste.
Optimizing code flow
The first step we made, by 2010, was to significantly flatten the branch structure so that there are now very few branches, and they are usually quite temporary. We created an explicit goal to optimize code flow, in other words, to minimize the time between a check-in or commit and that changeset becoming available to every other developer working.
Moving to Git
The next step was to move to distributed version control, using Git, which is now supported under Team Services and TFS. Most of our customers and colleagues continue to use centralized version control, and Team Services and TFS support both models. Git has the advantage of allowing very lightweight, temporary branches. A topic branch might be created per work item, and cleaned up when the changes are merged into the mainline.
Merging in tiny batches
All the code lives in Master (the trunk) when committed, and the pull-request workflow combines both code review and the policy gates. This makes merging continuous, easy, and in tiny batches, while the code is fresh in everyone’s mind.
This process isolates the developers’ work for the short period it is separate and then integrates it continuously. The branches have no bookkeeping overhead, and shrivel when they are no longer needed.
Agile on steroids
We continue to follow Scrum, but stripped to its essentials for easy communication and scaling across geographies. For example, the primary work unit is a feature crew, equivalent to a scrum team, with the product owner sitting inside the team and participating day in, day out. The product owner and engineering lead jointly speak for the team.
You build it, you run it
The definition of done is conceptually very simple. You build it, you run it. Your code will be deployed live to millions of users at the end of the sprint, and if there are live-site issues, you (and everyone else) will know immediately. You will remediate to root cause.
Monitoring our service
Arguably, the most important code in Team Services is its telemetry. The monitoring cannot become unavailable when the service has an issue. If there is a problem with the service, the monitoring alerts and dashboards need to report the issue immediately. The monitoring infrastructure is primarily the same that we make available to customers through Applications Insights.
Every aspect of Team Services is instrumented to provide a 360 degree view of availability, performance, usage, and troubleshooting. Three principles apply to our telemetry.
- We err on the side of gathering everything.
- We try hard to focus on actionable metrics over vanity metrics.
- Rather than just inputs, we measure results and ratios that allow us to see the effectiveness of measures that we take.
On average, we will gather 200GB of telemetry data per day for Team Services. (Across Microsoft, Application Insights gathers over 700TB daily.) Although we can drill to the individual account level, per our privacy policies, data is anonymized unless a customer chooses to share details with us.
Tracking live site
Of course, we take nothing for granted. Everyone is trained in “Live Site Culture.” In other words, the status of the live service always comes first. If there is a live site incident, that takes precedence over any other work, and detecting and mitigating the problem is top priority. We have live dashboards in all our hallways, so everyone is aware of our success or shortcoming.
Remediate at root cause
All Live Site Incidents (LSIs) are logged in our Team Services, driven to root cause, and reviewed weekly. The root cause analysis is a shameless post-mortem that seeks to identify what engineering needs to be done to prevent recurrence of similar problems. That work is then prioritized on the common product backlog.
Learnings: DevOps is the second decade of Agile
As we have moved to DevOps, we have come to assess our growth in seven practice areas, which we collectively think of as the Second Decade of Agile.
Agile scheduling and teams — This is consistent with Agile, but more lightweight. Feature crews are multidisciplinary, pull from a common product-backlog, minimize work in process, and deliver work ready to deploy live at the end of each sprint.
Management of technical debt — Any technical debt you carry is a risk, which will generate unplanned work, such as Live Site Incidents, that will interfere with your intended delivery. We are very careful to be conscious of any debt items and to schedule paying them off before they can interfere with the quality of service we deliver. (We have occasionally misjudged, as in the VS 2013 launch story above, but we are always transparent in our communication).
Flow of value — This means keeping our backlog ranked according to what matters to the customers and focusing on the delivery of value for them. We always spoke of this during the first decade of Agile, but now with DevOps telemetry, we can measure how much we are succeeding and whether we need to correct our course.
Hypothesis-based backlog — Before DevOps, the product owner groomed the backlog based on the best input from stakeholders. Nowadays, we treat the backlog as a set of hypotheses, which we need to turn into experiments, and for which we need to collect data that supports or diminishes the hypothesis. Based on that evidence, we can determine the next move in the backlog and persevere (do more) or pivot (do something different).
Evidence and data — We instrument everything, not just for health, availability, performance, and other qualities of service, but to understand usage and to collect evidence relative to the backlog hypotheses. For example, we will experiment with changes to user experience and measure the impact on conversion rates in the funnel. We will contrast usage data among cohorts, such as weekday and weekend users, to hypothesize ways of improving the experience for each.
Production first mindset — That data is reliable only if the quality of service is consistently excellent. We always track the live site status, remediate any live site incidences at root cause, and proactively identify any outliers in performance to see why they are experiencing slowdowns.
Cloud ready — We can only deliver a 24x7x365 service by continually improving our architecture to refactor into more independent, discrete services and by using the flexible infrastructure of the public cloud. When we need more capacity, the cloud (in our case, Azure) provides it. We develop every new capability cloud-first before moving into our on-premises product, with a few very intentional exceptions. This gives us confidence that it has been hardened at scale and that we have received continuous feedback from constant usage.