Adopt Site Reliability Engineering to Win
Software companies can accelerate their business and deliver reliable products by adopting site reliability engineering principles.
Join the DZone community and get the full member experience.
Join For FreeThe coronavirus pandemic accelerated the offering of online services even in the most traditionally "offline" sectors: fitness, banking, retail, and government.
As more companies started moving their offerings online, the reliability of these critical services has garnered particular attention from the public. In early July, Rogers, a Canadian telecom behemoth, experienced an outage that lasted 19 hours, disrupting internet and telecom services for over 10 million Canadians. For the duration of the incident, critical services, 911, and hospitals were interrupted.
Per a Forbes investigation into an outage in 2013, Amazon's downtime costs the company roughly $66,000 per minute; that number would easily be 10x with the current sales numbers Amazon posted in FY22.
Service downtime or outages have become far too familiar across industries, causing financial losses and threatening the credibility of brands. It is not too far-fetched to say that companies can use their reliability status as a competitive strategy to stand apart in the ultra-competitive landscape. Historically neglected as a cost center, reliability engineering is slowly establishing itself as an integral part of engineering organizations with the growing popularity of SRE/DevOps.
The site reliability engineering (SRE) discipline leverages software engineering to manage IT operations effectively. Software engineering principles are applied to solve the complex operational challenges faced by today's ever-critical IT teams. The SRE concept was popularized by Google, where Google recruited software engineers to solve complex challenges that Google experienced during its hyper-growth trajectory.
In the pre-DevOps world, companies hired specialist IT teams to manage the operational workload of the company. The primary purpose of IT teams was to keep things running, but scaling these teams to support the growing organizational needs was extremely difficult due to the lack of automation and software tools. Growing scaling issues resulted in slow velocity and mounting tech debt, mainly due to the division of responsibilities between software and operations teams. Adopting SRE principles is quintessential to effectively manage the operational challenges and quickly scaling the team to meet the team's operational needs. In contrast, SRE takes a holistic approach to operations by adopting software tools and automation in building robust CI/CD pipelines, observability platforms, incident management frameworks, disaster recovery processes, and resiliency protocols.
It is never too early to start investing in reliability efforts as an organization, but adopting reliability engineering practices is not as straightforward due to the lack of a standardized playbook. To successfully adopt the SRE principles, organizations can first start by assessing the organization's needs and setting clear, actionable goals that align with the organizational goals. If the company does not have an in-house expert, an external counsel or advisor can help the leadership set clear goals around reliability, availability, and developer productivity. Setting unrealistic goals can be prohibitively expensive and might result in failed endeavors. After setting goals, organizations can go a long way in hitting reliability goals defined by focusing their efforts on observability, automated deployments, and incident management with the current engineering resources. As the organization scales, it can start by hiring specialist site reliability engineers to meet the reliability goals of the organization.
As the reliability engineering efforts take footing in a company, it is not unusual to see two factions: One favors the reliability of a product, and the other favors the increased velocity of feature development. If we closely observe the reliability and velocity of products go against each other, extreme reliability requires careful execution, resulting in reduced feature development velocity. If higher velocity is favored, then the reliability of the product suffers. To prevent the fierce tussle between the reliability and product teams, inculcating the "site up" culture from the leadership is critical. When the whole company buys into the "site up" mindset, the teams' decisions favor building a reliable product faster without sacrificing either reliability or velocity.
Major internet companies have shown that adopting SRE principles can impact the business in an outsized manner by improving the developer productivity and reliability of online services. To conclude, software engineering companies can adopt SRE principles in their organization to position themselves to win in the present hyper-competitive landscape.
Opinions expressed by DZone contributors are their own.
Comments