Congestion Control in Cloud Scale Distributed Systems

Distributed systems must handle traffic surges and provide predictable performance for the best customer experience.

Ankit Kumar

Dec. 19, 23 · Analysis

Likes (1)

Comment

Save

3.3K Views

Distributed systems are composed of multiple systems that are wired together to provide a specific functionality. Systems that operate at a cloud scale can get expected or unexpected surges of traffic from one or multiple callers and are expected to perform in a predictable manner.

This article analyzes the effects of traffic surges on a distributed system. It lays out a detailed breakdown of how every layer is affected and provides mechanisms to achieve predictable performance during traffic surges.

Anatomy of a Distributed System

To understand how traffic surges affect different layers of a distributed system, let's look at the composition of a typical distributed system:

Client: Client software sends a request to an endpoint. The endpoint is resolved via DNS to point to a Virtual IP address (VIP).
VIP (Virtual IP address) and Load Balancer: A VIP address, in turn, points to one or more load balancers. A load balancer is a networking device that proxies traffic to a set of service hosts.
Service Fleet: A service fleet is comprised of multiple service hosts. A service host is a physical box where the service application logic runs.
Dependent Service(s): A set of dependent services might do some work for a customer request. For example, in the case of a service that needs some state of the data to answer the customer request, the state may be looked up from a data storage system/service.

Traffic Surges, Congestion, and Predictable Performance

Traffic surges lead to congestion in one or multiple layers of the distributed system. Let's analyze how congestion affects each layer and how we can provide predictable performance. Let's look at the call stack bottom up:

Dependent Service(s)

A congestion in dependent services results in request timeouts or increased latency at the service host layer. This, in turn, results in the threads being held longer in the service host per customer request. In extremities, this can lead to thread exhaustion on the service host, which in turn leads to request queuing on the load balancer. The queuing on the load balancer can lead to spillovers, which means that for any new request, the load balancer cannot route it to a viable host, leading to request timeout for the client.

To provide predictable performance in case a dependent service is suffering from congestion, we can take the following approaches:

Rate limiting: We can rate limit traffic on the service host so that the service host receives less traffic than the dependent service can process.
Circuit breaker: Circuit breakers track the health of a dependent service and reject any new requests that require the dependent service to be available till the dependent service is recovered.
Thread pool isolation: Another approach is thread pool isolation for requests that rely on dependent service being available. The thread pool isolation allows a set capacity from the service host to be allocated for the dependent service-related work. This allows the service to continue addressing other requests in a predictable manner while the dependent service works through the congestion.

A dependent service can also act on requests asynchronously by tracking them in a queue and working offline on the queued requests. This is typically referred to as a producer-consumer pattern. We talk about the congestion characteristics of this pattern in a separate article.

Service Fleet

Congestion in the service fleet caused by a traffic surge can lead to request timeouts for the customer. Congestion happens because the service fleet has hit a certain limit. This limit can be a software or a hardware limit. Some examples are as follows:

Thread pool size limit that governs how many request threads can be worked upon at a given time on a service host.
Physical hardware limits such as disk space, network bandwidth, or CPU usage limits.

Let's look at the ways we can configure the service fleet to provide predictable performance in the face of a traffic surge:

Understand service host limits: We should test the service fleet and understand the various scaling cliffs. In most systems, the hardware limits are difficult to reach, and we typically run into a software limit before hitting a hardware one. By understanding the maximum request volume at which our service hosts(s) perform predictably, we can set limits that ensure a host does not process more requests than that.
Rate limit traffic: We should rate limit traffic on service host(s) so that any additional traffic outside the bounds is rejected. This leads to a service host providing predictable performance. A good way to implement rate limiting is via a token bucket algorithm.
Scale on demand: An external capacity monitoring system can track how much traffic is being served from the service fleet and add hosts when traffic is increasing beyond a certain threshold. While the scaling kicks in, certain traffic will be rejected, but this is a better alternative than taking in more traffic than what hosts can handle, which can lead to the entire service collapse.
Prevent performance regressions: Performance regressions are possible whenever new code is deployed across a service fleet, or hardware is swapped. A good way to prevent this is to bake performance regression testing in the release workflow for the software that is to be deployed on the service fleet and also for any planned hardware changes.

VIP and Load Balancer

VIPs are just DNS mappings. We will keep DNS congestion out of the scope of this article as that is a separate topic. DNS systems heavily rely on caching and rarely get congested.

Load balancers direct requests to a service host. Congestion in load balancers can lead to request timeouts for customer requests. Like any other physical computing device, the load balancer has its limits as well. One such limit that most systems run into is the network throughput limit on the load balancers. A simple way to avoid running into this limit is to track how close to the cliff we are, add more load balancers, and provision their IP addresses in the VIP configuration.

Client

The client plays a pivotal role in how congestion happens and how it can help the system recover faster. When a system is congested, it rate limits traffic to aid in recovery. Clients re-try requests on receiving timeouts or being rate-limited. Re-tries lead to more traffic for the service, thereby spinning in this circle of traffic surge -> timeouts/rate limiting -> re-tries -> traffic surge. An effective way to avoid the traffic surge from retries is to add jitter and exponential backoff in clients.

Conclusion

Traffic surges can catch any cloud-scale distributed system unguarded. Understanding various system limits, tuning each layer for predictable performance, rate limiting, and scaling on demand can help us provide the most optimal customer experience.

Domain Name System Thread pool Analyze (imaging software) Cloud Load balancing (computing) systems

Opinions expressed by DZone contributors are their own.

Related

Trending