Saving Interesting Observability Data Using SLO-Based Retention Policy
Observability data has a short shelf life, but that doesn’t stop it from growing to fill all available space.
Join the DZone community and get the full member experience.
Join For FreeIn any data management policy, there are two extremes: save everything (just in case), and delete everything that ages out. The two extremes work hand in hand, as eventually, you decide that even if you want to save it all, the realities of storage costs have forced you to delete your data arbitrarily.
Ideally, you would retain "interesting” data that might be useful and delete the rest. Even better would be to not collect the data in the first place.
Business and Environmental Impact of Excess Data
Storing vast quantities of data is not only terrible for business but also disastrous for the planet. According to The Sift Project think tank, global emissions from cloud computing range from 2.5% to 3.7% of all global greenhouse gas emissions, and storing monitoring data, logs, traces, and other metrics you may never look at again is part of this footprint.
Taking a mindful and targeted approach to storing relevant and valuable data only and proactively cleaning up boring data can significantly reduce these harmful emissions. Operations staff benefit by having additional annotated context and can be much more productive in analyzing past events. And the business can (for once) see hard cost savings in reducing the overall storage and data transfer, which has a bottom-line impact that you can’t ignore.
Critical vs. Boring Time Periods
Operations teams always want to come back and analyze what happened after an incident. Perhaps you have automated runbooks that scale up your services during peak load times, and you’d like to know if it worked. Your automation should have rolled back, turned off feature flags, or failed over to another site — did it?
New deployments are usually critical for retrospectives as you understand how impactful a release is on production user experience or if you’re trying to move from maintenance windows to rolling updates. Save the critical data, and discard the rest.
Programatically Categorizing Boring From Critical
Service level objectives (SLOs) are goals for how a service performs versus expectations to keep users happy and give an early warning for degrading performance. Many organizations automate SLOs by defining them in code (using OpenSLO, for example) to optimize their incident response and resource planning. Automated SLOs can also be used as part of a data management strategy because they give a clear picture of when a system is behaving normally (“boring” times) and when something unexpected or critical is happening (“interesting” times). Service level indicators (SLIs) that feed SLOs make excellent summary views and allow you to discard more detailed data, especially when combining multiple data sources into a single SLO.
Automated SLOs can also trigger time-stamped events that add additional enrichment to the data (often referred to as “annotations”), which can later serve as the “system of record” for the reliability of the system. If you annotate the exciting times, you can easily use this additional contextual data to create a data retention plan that gives you accurate fidelity on dramatic moments and moves boring data to cold storage, and is scheduled for eventual deletion.
With a sufficiently large and stable system, you may be able to use additional machine-learning methods to identify and annotate critical times. AIOps, machine-learning-driven operational automation, can play a role on top of the SLO and other forms of categorizing the time series data using the same critical event tracking system.
Case Study: Clean Up After Yourself
One company I worked with to improve observability data retention used SLOs to shorten data retention windows. They had a short window — 7 days — and the retention policy would delete all observability data. They had a lot of data across a vast platform, and the sheer volume of it forced them into a brute force method for controlling costs associated with saving the data. Of course, the team would have preferred to keep everything to investigate and learn how to improve it, which led to scrambling to understand what happened within a week! Unfortunately, the weekly purge made it hard to comprehend infrequent recurring issues because they couldn’t compare recent events to previous ones.
To solve this problem, they created a list of timestamps for changes to the running system that might be useful for investigation. A simple API would mark a timestamp and a text field with details about why the situation was worth a look. Then they automated calls to this API based on when an SLO was trending down, pushing new releases to production, or when an operator noticed an issue in real-time that might be the start of an incident. Because the source of these timestamps was both automated and manually tagged, plenty of context on the events was available to understand why the rules created these events without additional effort.
Once they had the timestamps, it was easy to automate a data collection system to scan the source data and export interesting times (+/- 1 hour from each timestamp) to cold storage before the retention policy permanently deleted the data per the retention period. Because they had confidence that they would save the interesting data, they were able to shorten the primary retention period to three days (halving hot storage costs) and gained much more visibility into recurring issues, significantly increasing the team’s ability to learn.
Case Study: Triggering Verbose Collection
Another company had a similar situation, but with a twist: they wanted to prevent the data from being written in the first place! The company has a large fleet of devices that emit log data centrally collected and analyzed. Moving that data over the network had high data transfer costs and was usually unnecessary (unless it absolutely was!). They erred on the side of collecting more data rather than less, but even with a cleanup strategy to reduce storage costs, the upfront network costs were still a significant recurring expense.
To solve this, they built an API that could adjust the amount of verbosity produced by the devices and sent home to the central observability system. They could trigger verbose metrics across any part of the fleet remotely and also revert the devices into concise mode. This became especially valuable during critical periods like over-the-air (OTA) updates when the team pushed new software updates and needed to monitor the rollout closely.
Before an OTA update, they would trigger “verbose” mode. Then they would perform the rollout and could have detailed logging to manage the release. As each device completed the update, it would revert back to “concise” mode. In addition to OTA updates, they could also trigger changes to the tracing level using SLOs. For example, if an automated SLO detected increased errors relative to expectations, they could automatically switch into the verbose mode to improve the team's ability to investigate. On the other hand, if there were network bandwidth issues and the data pipeline was slowing down, they could turn down the logging from concise mode to an even lighter form of metrics – heartbeat-only mode – and reduce the data load on the service.
The entire process, from starting the OTA update to modifying the logging levels to increasing error rates, and ensigning the OTA update, was annotated in a time series database using a simple API. This added additional context for investigation and aided in data retention and clean-up after the fact as well.
Keep the Interesting Data; Delete the Rest
Everyone is closely watching their budgets and looking for places to creatively cut costs while preserving productivity. Now is a perfect time to take a hard look at how much data you’re storing and what you’re actually doing with it. Observability data has a short shelf life, but that doesn’t stop it from growing to fill all available space. Even a small improvement to your collection, retention, summarization, and context building could be a huge boost and enable your organization to do much more with less.
Opinions expressed by DZone contributors are their own.
Comments