Why a Site Reliability Engineer Is Important to Your CI/CD Pipeline

Written from firsthand SRE experience, this article touches on the importance of SREs and some of the key benefits of their involvement in the CI/CD pipeline.

Alireza C

CORE ·

Mar. 29, 22 · Analysis

Likes (4)

Comment

Save

8.3K Views

This is an article from DZone's 2022 DevOps Trend Report.

For more:

Read the Report

Continuous integration and continuous deployment are the two major components of DevOps principles. Every organization that wants to move away from the traditional way of working has to learn, design, and implement a mature CI/CD pipeline. Having a mature CI/CD pipeline is a good start for site reliability engineering, but alone, it’s not enough. The site reliability engineering (SRE) methodology brings a new perspective to the software development life cycle by aiming to achieve reliability at scale.

Drawing on my own experience of being an SRE for more than five years, I will touch on some of the key benefits I've experienced and why it's important for SREs to be involved in the CI/CD pipeline.

SRE Engineer vs. DevOps Engineer Approach Toward CI/CD

Although DevOps and the SRE approach have many things in common, they are still two different approaches that were created for different purposes. SRE was created after DevOps, when it became apparent that the DevOps way of working could not tackle all issues and satisfy all requirements. That’s why we can see these different approaches toward the CI/CD pipeline, where the most important activities of the SDLC happen. I had a chance to work as both a DevOps engineer and an SRE engineer, and here are some differences that I observed:

Subject	DevOps Approach	SRE Approach
CI/CD pipeline	Aims at establishing a CI/CD pipeline where either there are no pipelines at all, or the one in place has not been properly implemented	Aims at modifying an existing pipeline and identifying bottlenecks and problems that impact lead time
Automation in CI/CD	Tries to automate everything in the CI/CD pipeline	Takes it one step further and automates incident management and production issues
Incident management and CI/CD	Has to align with different parties’ engineers to apply any changes	Has more freedom to decide and execute operations in order to mitigate issues quickly
Normal in CI/CD	Having a good, working CI/CD pipeline	There is no normal. Reliability is never taken for granted and, it is assumed that unexpected incidents will occur. This results in constant improvement of the CI/CD pipeline, reducing incidents, and increasing team maturity in mitigating issues on time.

Improvement From the Ground Up

KPIs are the core of decision-making for SRE engineers, and they become a performance dashboard that developers can see to view the quality of their work across various metrics. This informative approach makes development more conscious and aware of application/service performance earlier than releasing in production. Therefore, developers have a chance to identify issues and inefficiencies much earlier than a usual development life cycle without SRE.

SRE engineers start from the measurements and look at the existing CI/CD KPIs, if there are any. Otherwise, they define the KPIs themselves to indicate the current status of pipelines. Then they can create a roadmap with measurable milestones to improve the pipeline. This approach helps engineers consider various performance factors right away when they are busy with the functionality design process.

As SRE engineers create KPIs, they need to work closely with developers to understand the system logic, architecture, and components relations. This collaboration creates a team synergy where all engineers not only learn from one another, they master various skills in a team and can replace each other whenever it’s needed.

Traditionally, application functionalities are the most important part of the design architecture. That’s why, sometimes, aspects like high availability and reliability are not taken into account at the beginning. The SRE approach considers reliability, availability, and resiliency from the design stage. This results in huge savings from development and operations up front, since it is very costly to redesign projects if these issues surface later in production.

SRE Incident Management and CI/CD

When it comes to production incidents, it is crucial to detect issues and restore the system to its normal state. SRE practices came as an enhancement to DevOps practices. One interesting SRE approach is that engineers can deploy new patches during an incident without impacting the other parts of the running system. Downtime is inevitable when there is an incident or a new deployment is in progress. SRE engineers constantly try to reduce downtime, however, and they use new techniques called zero downtime deployment. SRE engineers can decide on the required change or fix and immediately trigger the pipeline to release the change from dev to production.

SRE engineers do not follow a bureaucratic approach in which a certain number of parties have to be involved in the production environment's decisions, and there is no hierarchy in place. The SRE approach takes risk, but it creates an autonomous team that can decide and act fast on incidents.

However, this doesn’t mean that the other parties are never involved or informed properly. Here are some examples of practical situations where communication should happen:

Suppose there is something to be done in a high priority which disrupts the production applications and services. In that case, a client should be informed before, during, and after the operation to ensure everything is under control related to live operations, data loss, and so on.
Suppose there is a rollback operation and an old version of an application or a service should be installed. In that case, the development team should be informed and involved to ensure there is no problem with other services after this rollback.
Suppose any process, deployment step, or even any line of code which developers write should be changed by SRE developers in an emergency. In that case, the development team should be informed afterward to make sure they are aware of these changes and the reasons they were made.

SRE Proactively Trains Team Members on CI/CD

When we talk about quality, we should be able to turn the quality into numbers. Then, we can quantify the quality with some metrics. I remember when we created our first-ever production dashboard out of application performance. Most people did not get what we were doing. As we rolled it out, however, it was visible how much memory and disk space were being used by every production server. After a couple of weeks, non-operational people started to notify us about the application quality. They simply looked at the dashboard and spotted some warnings.

Since it was easy to understand, they were able to get the problem quickly — and even took initiative to make sure we were aware of it. This is an example of how we managed to create some basic metrics and define a baseline to check the production performance. Before having an extensive monitoring dashboard, you can still get better control over your platforms with basic monitoring metrics. Here are some common metrics you can create if you are still new to this area:

If you have IaaS, you can start with monitoring your infrastructure availability— resources like your CPU, memory, and hard disk. These areas are the most common troublemakers, so you can identify issues before they become a disaster.
If you have some web services running, you can start with monitoring your service availability by checking the endpoint. Additionally, you can monitor the HTTP errors to understand what is happening with requests.

The SRE approach is strongly against creating a silo. SRE engineers work closely with developers, testers, and anyone who impacts the software project. This collaboration creates a strong knowledge-sharing loop in which most team members can pick different tasks and responsibilities. Moreover, this approach creates new SRE engineers from developers and testers interested in understanding application design, implementation, and operations.

Common Misconceptions About SRE

When any methodology is used incorrectly, it might not be as useful or effective as when it’s properly implemented. The same goes for SRE, a new version of DevOps. When an organization that is not mature enough in DevOps considers implementing SRE, wrong perceptions can lead to much confusion. After years of doing DevOps and SRE activities, I learned that you need to have a good understanding of DevOps to become a good SRE. The reason is that DevOps is the predecessor of SRE, and to identify why we are doing things in an SRE way, you need to know the history behind that.

Another common misunderstanding happens when companies see the SRE as just another expert in handling incidents and operations. Let’s refer to the Google definition of SRE. We learn that SRE is considered a team made up of different experts who can build, run, and maintain application services autonomously. SRE goes one step further than DevOps and takes all responsibilities. This way, you have full control of your SDLC and have one team that communicates, decides, and implements things very quickly. Having a good understanding of the context of SRE is key to making sure you can implement it properly.

Bottom Line

The SRE approach is the latest advancement of the DevOps way of working. It offers best practices to keep all services and applications running reliably. SRE works smoothly with CI/CD pipelines; you can constantly see where you are and what can be improved in your SDLC. This keeps you on track at all times, and it helps you avoid taking any success for granted. SRE engineers are the frontrunners on these efforts — they bring this mindset to an organization. SRE engineers define their KPIs based on customer requirements and what makes the platforms reliable. These requirements can change every day, so SRE engineers help teams adapt to these changes while the production reliability stays intact.