Five Steps To Building a Tier 1 Service That Is Resilient to Outages
If Tier 1 services fail, it can mean disaster for a business. To ensure resiliency from outages, follow a proven five-step process.
Join the DZone community and get the full member experience.
Join For FreeTier 1 services are critical to a company’s profitability. They address the primary use cases for the organization’s products or support its underlying vital infrastructures, such as a product search service for an e-commerce site, a content posting service for a social media app, or a payment processing framework. Building a robust Tier 1 service, either a front-facing platform that powers customers' experience or the backend services behind that platform, is crucial to driving a positive customer journey.
Tier 1 services should be resilient to outages because availability issues can lead to mistrust and dissatisfied customers. A bad encounter with a website or an app that locks up or is painfully slow (latency), especially when there are other options in the same business sector, can spell disaster for the company behind the service. That is why it is important to build these services right the first time.
Building a successful Tier 1 service starts during the design phase and involves planning and forecasting the user traffic load. It requires having a mechanism to test resiliency through stress, load, and chaos tests. This includes envisioning a worst-case scenario to ensure the service can handle the predicted volume of traffic and can either easily scale up or shed the load for unexpected traffic. Avoiding any outages, especially for extended periods of time, is critical, as is making any course corrections needed without delay. Gen Z customers—a significant segment of buyers in the current marketplace—and others do not have the patience to give a company’s online service or app one more chance to do its job in the current business environment.
How to Build a Successful Tier 1 Service
While there is no secret sauce for building a Tier 1 service, there are five steps companies can follow. They include:
1. Design the Service to Scale Horizontally
Scaling a service means being able to increase the available resources (such as CPU, memory, disk, and network I/O) for the service to handle additional customer traffic. With vertical scaling, companies will boost the capacity of the resource in a single box. Often this is a costly approach and has its limits on how much higher developers can go to expand the resources. With horizontal scaling, developers can easily add new hosts to address the scaling needs. Other areas to incorporate in the design phase include avoiding latency that makes the end-user experience too slow. For example, it is better to rely on asynchronous processing for faster response time, where content upload and data synchronization across multiple data centers takes place in the background without impacting the customer experience.
2. Design the Physical Network With Redundancy
Hardware and network failures are common for distributed services. Physical network failures can be anything from a breakdown of a cooling fan of a single host to network malfunction that brings down multiple servers to power outages and even earthquakes that affect an entire data center. To mitigate against these failures, design the physical host capacity and network with redundancy. For example, spread the hosts across multiple data centers or availability zones. Use load balancers to distribute the system traffic among multiple servers that are in turn spread across multiple data centers. In this way, developers can eliminate a single point of failure by spreading their resources across strategically placed data centers.
3. Test the Service for Resiliency Continuously
Similar to how the service functionality is tested through the unit and practical tests, the resiliency of the services can be tested through load, stress, and chaos tests. Load testing verifies the performance of the service against the expected throughput under real-life-based load conditions. Here, the test traffic load threshold is capped at the breaking point for the service (the system cannot handle more traffic beyond the breaking point). Stress testing verifies the robustness of the system under extremely heavy load conditions, which go beyond the breaking point. This gives developers data points on whether the service can recover quickly and safely after getting hit with traffic beyond its capacity to handle. Chaos testing verifies the service integrity by proactively simulating and identifying failures in the environment before they lead to unplanned downtime or a negative user experience. These tests simulate issues such as latency increase, network packet loss, and dependency failures to verify if the system can still function appropriately under these failures and safely recover. Together, running these performance tests along with the functional tests will ensure service resiliency and robustness.
4. Deploy the Changes to the System in a Staggered Fashion
After the changes pass the resiliency testing criteria, it is important to adhere to specific guidelines for deployment. Relying on a simple strategy to implement the changes everywhere all at once will lead to potential downtime of the service if the change has defects. Instead, release the changes to a single host (or a small set of hosts) in production first and validate the change against the customer traffic. Then check for any signals from the defects in those initial hosts and, once verified, deploy the changes to the entire fleet. While deploying the changes broadly, stagger the deployment so that no more than a small percentage of hosts are used at once. Also, extend the same principle to deploy the changes to service worldwide. Deploy the changes to one region, like North America or Europe, at one time. Release changes incrementally to the smallest number of customers or hosts first to limit the “blast radius” if something goes awry. Also, developers can gate the changes with feature toggles that can be enabled or disabled to ramp up the new release slowly.
5. Implement Operational Safeguards
The last but most critical step to achieving and maintaining excellence in a Tier 1 service is implementing operational best practices. In the lifecycle of a service or product, most of the time is spent on maintaining and running the service (operations) compared to the design phase. To start, create a core set of metrics related to functionality and performance. One way is to build meaningful dashboards to monitor these metrics. Include automated alarms on these metrics to detect the issues within the service as fast as possible. For example, any performance degradation should alert the service owners in under 10 minutes. If an alarm goes off, there should be well-documented standard operating procedures (SOPs) for the responder to follow so that the incident can be quickly root-caused and mitigated. Once the incident is mitigated, the team should investigate and come up with actions to prevent the incident from happening again. Another best practice is to perform a weekly operations review with the entire team. This review will cover looking at the alarms that fired up over the last week, the metrics in the dashboard, and any other key metrics related to the operations process. Together as a team, using this review as a retrospect mechanism to improve the health of the overall operation will pay dividends in the long run by promoting the operations excellence within the team.
Resiliency Culture
It is sad to say, but many businesses get it wrong and learn the hard way on operating a Tier 1 service. There is no single solution to avoiding an outage, but these five steps will safeguard against most of the common reasons for an outage and enable a service to be more resilient to any unexpected outages. Also, instilling a corporate culture that not only prioritizes getting the resiliency aspect right the first time before a launch but also with every single change to the service thereafter is important too. This code of conduct permeates even with employee turnover. There are no shortcuts, and one system crash due to high customer traffic is one too many. Companies that build in the right steps to take when there is an issue and perform a rigorous testing process before their product hits the online marketplace are more likely to develop a loyal audience for that site or app and turn those casual users into repeat customers.
Opinions expressed by DZone contributors are their own.
Comments