How To Reduce MTTR
This article discusses ten ideas that can help reduce Mean Time To Recover for your critical production services.
Join the DZone community and get the full member experience.
Join For FreeAs a Site Reliability Engineer, one of the key metrics that I use to track the effectiveness of incident management is Mean Time To Recover (MTTR). Based on Wikipedia, MTTR is defined as the average time that a service or system will take to recover from any failure. Trying to achieve a low MTTR is key to achieving service level objectives and in turn, service level agreements of any critical production service.
10 Things That Can Help Reduce the Mean Time to Recovery (MTTR)
1. Clearly Defined SLIs
Service level indicators or SLIs are the key indicators that measure the health of your service. A few examples of SLIs are error rate, latency, throughput, etc.
2. Actionable Alerts Based on SLIs
The alert strategy should include improving the signal-to-noise ratio of the alerts. The goal with alerting is that every alert that your team gets should be actionable. Sending too many alerts will cause alert fatigue and will have the risk of the on-call person ignoring alerts that indicate real issues with the service.
3. Troubleshooting Guides Associated With Alerts
Every alert should have a clearly defined troubleshooting guide on how to triage and mitigate the issue the alert identifies. A good methodology to use while writing these troubleshooting guides is the USE methodology, suggested by Brendan Gregg in his book, "Systems Performance." USE stands for Usage, Saturation, and Errors.
4. Practice Troubleshooting Guides
Practicing troubleshooting guides periodically will help mitigate incidents when they occur. It will also help identify gaps with the TSGs since services evolve over time. A few examples of a good time to practice troubleshooting guides is when a new team member joins the team so that they can give a fresh perspective of the TSG. This will reduce assumptions about the knowledge of the system.
5. Usable Dashboards
The observability strategy should include creating easy-to-use dashboards. The dashboards should have panels to include the key metrics of the services and the health of dependent services such as upstream and downstream services. A few examples of important metrics that should be included in the dashboards are the golden signals suggested by the Google SRE book such as latency, throughput, error rate, and saturation metrics.
6. Automated Actions To Mitigate Issues
Automating certain actions based on the metrics and events is key to reducing MTTR. An example of this is taking certain servers out of rotation if packet loss is observed from these servers. This will help reduce the impact on user experience and reduce MTTR.
7. Failovers Rehearsals
In the case of multi-data center architectures, it is crucial to have failover plans defined to make sure to recover from an outage of a specific data center quickly. Practicing these failover scenarios periodically will help to quickly execute them during an outage. This will also help in identifying any gaps in the failover plans and give the chance to update and fix the failover plans.
8. Automated Failovers
Once the failover plans are defined, implemented, and practiced, the next step is to automate these failover scenarios based on the health checks of the service on a given data center. This will help to mitigate the issues faster and thus reduce the MTTR.
9. Change Management Process
Changes to production systems are a major cause of outages. It is important to have a well-thought-out change management process in place. A few key elements of the change management process should include clearly defined checklists, change review and approval procedures, automated deployment pipelines with built-in monitoring, and the ability to quickly roll back the changes if any issues are observed.
10. Easy To Identify Change List and Automated Rollbacks
There can be multiple changes continuously done in distributed systems where services are designed as microservices. Having a central system where one can easily identify which changes have been done during a given period of time will help to identify if a specific change has caused an outage and is thus easy to roll back.
Conclusion
In this article, I have discussed 10 things that can help reduce the Mean Time To Recovery of any critical production service. This is not an exhaustive list, but a list of best practices based on my years of experience working as a Site Reliability Engineer on services such as TikTok, Microsoft Teams, Xbox, and Microsoft Dynamics.
Opinions expressed by DZone contributors are their own.
Comments