Here Are the Metrics you Need to Understand Operational Health
In this blog post, we’ll walk you through holistic measures and best practices for understanding the operational health of your systems.
Join the DZone community and get the full member experience.
Join For FreeIn recent polls we’ve conducted with engineers and leaders, we’ve found that around 70% of participants used MTTA and MTTR as one of their main metrics. 20% of participants cited looking at planned versus unplanned work, and 10% said they currently look at no metrics. While MTTA and MTTR are good starting points, they're no longer enough. With the rise in complexity, it can be difficult to gain insights into your services’ operational health.
In this blog post, we’ll walk you through holistic measures and best practices that you can employ starting today. These will include challenges and pain points in gaining insight as well as key metrics and how they evolve as organizations mature.
Pain Points for Creating Useful Metrics
It’s easy to fall into the trap of being data rich but information poor. Building metrics and dashboards with the right context is crucial to understanding operational health, but where do you start? It’s important to look at roadblocks to adoption thus far in your organization. Perhaps other teams (or even your team) have looked into the way you measure success before. What halted their progress? If metrics haven’t undergone any change recently, why is that?
Below are some of the top customer pain points and challenges that we typically see software and infrastructure teams encounter.
Lack of data: Your data is fragmented across your APM, ticketing, chatops, and other tools. Even worse, it's typically also siloed across teams that run at different speeds. A lot of it is tribal knowledge, or it simply doesn't exist.
No feedback loop: There's limited to no integration between incidents, retrospectives, follow-up action items, planned work, and customer experience. It’s challenging to understand how it all ties together as well as pinpoint how to improve customer experience. You’re constantly being redirected by unplanned work and incidents.
Blank slate: Traditional APM and analytics tools are great for insights, but without a baseline of metrics that are prescriptive and based on operational best practices, it's hard to know where to start.
One-size-fits-all: What works for one team won't necessarily work for another. Everything needs to be put in the right context to provide truly relevant insights.
With these pain points in mind, let’s look at some key metrics other organizations we’ve spoken to have found success with.
Key Metrics for Operational Health
For all metrics, the incentives across cross-functional software teams, development, operations, and product all need to be aligned. Metrics aren't there to crack the whip; they're simply a tool to help with planning. We like to think of them as the compass to help teams go down the right maturity path. If they're not being met, there should be an effort to understand why without blame. It’s highly likely that they're not being set in an achievable manner. It takes everyone on board to achieve a deeper understanding.
When we look at the metrics, we want to make sure that we have the right context, as metrics without context can lead to the wrong behaviors (such as fear-based management). We are not measuring just for measuring, and we don't need all the metrics in the world. We need to choose those metrics that help us get closer to doing what’s best for customers and the team.
Here are some common metrics that, with the right context, we’ve seen organizations have success with:
Velocity
Arguably one of the most frequently used dimensions of measurement.
Key measures include sprint capacity planning and how quickly the team pushes new features to production
Availability
The probability that a system is operational at a given time
Key measures include understanding the system and the team’s ability to recover from incidents and interrupt work
Engineering Toil
How much of the team’s time is spent on thrash, how much operational inefficiency is there in the system
Key measures include reduction of automation and cognitive overhead
Product Quality and Customer Happiness
Understanding customer happiness level
Key measures include understanding the status of key user journeys (SLOs), reactive incident response, and more.
Each of these metrics on their own is not likely to be indicative of operational health. For example, teams with higher deploy rates might seem more successful than teams with lower deploy rates. However, if the team with higher deploy rates also has a higher rate of failures or regressions, it’s no longer clear which team is the most successful.
Teams will need to figure out the context relevant to each metric. This might take time. After all, each team and organization will be at a different level of the metrics maturity model.
Metrics Maturity Model
Based on our customer interviews, we’ve mapped out the above maturity model which depicts how metrics evolve with ops maturity, ranging from fragile to leaders in the space. We’ve outlined some key characteristics of each.
Fragile: This level of maturity is where many organizations find themselves. It’s reactive and stressful. At this stage, teams are likely measuring one thing: the number of incidents or customer tickets. This is a context-less stage where only the number is relevant. A quarter with 53 incidents may be seen as less successful than a quarter with 45 incidents, even if the 53 incidents were quick fixes that caused little customer impact. Or perhaps many of the incidents are labeled as either Sev0 or Sev1, either because teams aren’t sure how to classify incidents properly, or because each incident is that dire. As a result, the majority of the time is spent on unplanned work.
Unified: At this level, teams are likely sorting incidents by types and tags, gaining some insight into problem areas that had previously gone undiscovered. With this increased visibility, incidents will likely be spread between severe and less severe due to improved incident classification and mitigation capabilities. However, 30-50% of the time is still spent on unplanned work.
Advantage: Teams at this maturity level have more advanced metrics as well as SLOs to help pre-empt customer impact. This allows for data-driven tradeoffs in prioritizing reliability work alongside feature work. More mature teams make smaller, faster changes and can better localize the blast radius of incidents, so the majority of incidents typically fall in the Sev2 or Sev3 categories. At this point, less than 30% of the time is spent on unplanned work.
Leader: This level of maturity is one that less than 1% of enterprise companies have achieved today. It is characterized by advanced practices such as graceful service degradation and fault tolerance to minimize customer impact even during the most unexpected events (ie massive, sudden changes in scale). Customers will see less than 1% of incidents and 20% or less time is spent on unplanned or reactive work.
While achieving the Leader stage may seem nearly impossible, it’s not out of reach for dedicated teams. Rather than try to reach this level all at once, however, teams should look at small, incremental improvements that allow them to gradually build their metrics base, as illustrated in the case study below.
A Metrics Case Study
At the beginning of 2019, one of our customers — a global eCommerce company — wasn’t tracking any of its incidents. We helped the team hone in on a few basic metrics to start with, such as the cost of time spent on incidents, incidents by severity levels, and percentage of time spent on planned work (ie features) vs. unplanned work (ie incidents, bugs). During the first half of the year, the team worked hard to build those metrics.
In the second half of the year, the team had a solid baseline from which to understand trends and improvement opportunities from the metrics. At the checkpoint, the team identified that 45% of total engineering time was spent on unplanned work, which was equivalent to costing $200K per month.
The majority of the incidents happened on the product page in regards to load time and checkout process failures. With this data, the team began doing deeper triage to understand what caused incidents for these points in the user journey. Further investigation showed that the failures were associated with database tag and API to 3rd party fraud services and payment integrators.
During Q1 in 2020, we helped the team build a focused improvement that consisted of the following:
Rewrite the database queries and indexes to improve quality and performance.
Improve API connection handling and error handling.
Replace one of the fraud service providers.
Make changes on the CDN provider to improve the speed of dynamic objects and increase the TTL of static objects.
After Q1 of 2020, the team ran the reports again and saw a 76% reduction of incidents on the key user journey flows (product pages and checkout process). The percentage of total engineering time spent on unplanned work dropped by 40%. While this isn’t the end of their journey to understand operational heath, it’s a great starting point.
Published at DZone with permission of Hannah Culver. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments