Implementing SLAs, SLOs, and SLIs: A Practical Guide for SREs

Explore definitions along with how SLAs, SLOs, and SLIs help in effective monitoring and maintaining system performance.

Karthigayan Devan

Jun. 13, 24 · Analysis

Likes (3)

Comment

Save

5.4K Views

In today’s Information Technology (IT) digital transformation world, many applications are getting hosted in cloud environments every day. Monitoring and maintaining these applications daily is very challenging and we need proper metrics in place to measure and take action. This is where the importance of implementing SLAs, SLOs, and SLIs comes into the picture and it helps in effective monitoring and maintaining the system performance.

Defining SLA, SLO, SLI, and SRE

What Is an SLA? (Commitment)

A Service Level Agreement is an agreement that exists between the cloud provider and client/user about measurable metrics; for example, uptime check, etc. This is normally handled by the company's legal department as per business and legal terms. It includes all the factors to be considered as part of the agreement and the consequences if it fails; for example, credits, penalties, etc. It is mostly applicable for paid services and not for free services.

What Is an SLO? (Objective)

A Service Level Objective is an objective the cloud provider must meet to satisfy the agreement made with the client. It is used to mention specific individual metric expectations that cloud providers must meet to satisfy a client’s expectation (i.e., availability, etc). This will help clients to improve overall service quality and reliability.

What Is an SLI? (How Did We Do?)

A Service Level Indicator measures compliance with an SLO and actual measurement of SLI. It gives a quantified view of the service's performance (i.e., 99.92% of latency, etc.).

Who Is an SRE?

A Site Reliability Engineer is an engineer who always thinks about minimizing gaps between software development and operations. This term is slightly related to DevOps, which focuses on identifying the gaps. An SRE creates and uses automation tools to monitor and observe software reliability in production environments.

In this article, we will discuss the importance of SLOs/SLIs/SLAs and how to implement them into production applications by a Site Reliability Engineer (SRE).

Implementation of SLOs and SLIs

Let’s assume we have an application service that is up and running in a production environment. The first step is to determine what an SLO should be and what it should cover.

Example of SLOs

SLO = Target
- Above this target, GOOD
- Below this target, BAD: Needs an action item
  - While setting up a Target, please do not consider it 100% reliable. It is practically not possible and it fails most of the items due to patches, deployments, downtime, etc. This is where Error Budget (EB) comes into the picture. EB is the maximum amount of time that a service can fail without contractual consequences.

For example:

SLA = 99.99% uptime
- EB = 55 mins and 35 secs per year, or 4 mins and 23 secs per month, the system can go down without consequences. A step is how to measure this SLO, and it is where SLI comes into the picture, which is an indicator of the level of service that you are providing.

Example of SLIs

HTTP reqs = No. of success/total requests

Common SLI Metrics

Durability
Response time
Latency
Availability
Error rate
Throughput

Leverage automation of deployment monitoring and reporting tools to check SLIs and detect deviations from SLOs in real-time (i.e., Prometheus, Grafana, etc.).

Category	SLO	SLI
Availability	99.92% uptime/month	X % of the time app is available
Latency	92% of reqs with response time under 240 ms	X average resp time for user reqs
Error rate	Less than 0.8% of requests result in errors	X % of reqs that fail

Challenges

SLA: Normally, SLAs are written by business or legal teams with no input from technical teams, which results in missing key aspects to measure.
SLO: Not able to measure or too broad to calculate
SLI: There are too many metrics and differences in capturing and calculating the measures. It leads to lots of effort for the SREs and gives less beneficial results.

Best Practices

SLA: Involve the technical team when SLAs are written by the company's business/legal team and the provider. This will help to reflect exact tech scenarios into the agreement.
SLO: This should be simple, and easily measurable to check, whether we are in line with objectives or not.
SLI: Define all standard metrics to monitor and measure. It will help SREs to check the reliability and performance of the services.

Conclusion

Implementation of SLAs, SLOs, and SLIs should be included as part of the system requirements and design and it should be in continuous improvement mode. SREs need to understand and take responsibility for how the systems serve the business needs and take necessary measures to minimize the impact.

Site reliability engineering System requirements Cloud systems

Opinions expressed by DZone contributors are their own.

Related

Trending