Analysis of Failure Modes in Producer-Consumer Systems

Producer-consumer systems have interesting failure modes that can result in system downtime. Putting protections while building these systems can prevent failures.

Ankit Kumar

Dec. 04, 23 · Analysis

Likes (1)

Comment

Save

2.1K Views

Producer-consumer patterns are used extensively in systems all around us. In producer-consumer pattern-based systems, producers write data that one or multiple consumers consume. This pattern allows producer systems to scale while distributing functionality among multiple consumer systems. Just like any other distributed system, the uptime of a producer-consumer system depends on how well it protects itself from the various failure modes and, when impacted, how quickly it can recover.

In this article, we analyze the various failure modes of producer-consumer systems, their impact, how we can detect the failures, mitigate them, and review the protections we can put in place to prevent the failure modes from happening.

To understand the failure modes in this article, let us take an example of a Compute Reservation System that is used by customers to request compute capacity. When compute capacity is requested, many sub-systems need to work together to grant the capacity, such as :

Request management system: This system is used to validate customer requests, queue them for processing by other systems, and track them for completion. This system is the producer in our example.
Capacity reservation system: This consumer system reserves physical/virtual compute capacity for a customer request.
Storage reservation system: This consumer system reserves the storage capacity for a customer request.
Network allocation system: This consumer system allocates network addresses for the compute capacity requested by the customer.

Failure Modes

Let us look at the various failure modes:

Producer Is Down

If the producer system is down, no new requests will be queued up. The impact in our example system would be that customers will not be able to submit new requests for reserving compute capacity.

This failure mode can be detected in the following ways:

Emitting metrics on the number of requests being processed by the producer system and alarming if this number remains at 0 for multiple data points.
Implementing an external health monitoring system that can continuously submit special compute reservation requests called as "health checks" and poll their status for completion. If the Compute Reservation System is down, then the health checks will fail.

Mitigation of this failure mode requires mitigating the underlying cause of what resulted in the producer coming down in the first place. Some likely causes are:

A bad code deployment: The recovery would be rollback to the previous good code version.
Hardware faults: The recovery would be swapping with good hardware.
Disk full scenarios: The recovery would be disk cleanup and then fixing the underlying cause of whatever resulted in the disk being full in the first place (ex, log rotation not working).

The protections against this failure mode depend on what scenarios can cause the producer to go down. A very likely cause of this is bad code deployments. To protect against this, one strategy is to perform prod-preview code deployments. The prod-preview stage can be a dedicated stage in the producer's deployment infrastructure where new code is baked in for a couple of days before full production rollout. This stage can accept a small percentage of production traffic, thereby ensuring the stability of code before it is rolled out to the production fleet. Other protections are cell-based architectures, sharding, etc., which are outside the scope of this analysis.

Queue Is Full

The queue where the producer writes the requests can get full. When this happens, it results in requests failing at the producer layer. The impact would be an availability drop on the producer. There are a number of reasons why the queue gets full:

Connection issues between the queue and consumers result in messages not being picked up from the queue.
Consumers fail to process messages from the queue, thereby not acknowledging them, leading to the queue being full.
A rapid burst in customer requests at the producer layer results in the queue being overwhelmed.

This failure mode can be detected by emitting metrics such as the number of unprocessed requests in the queue, queue drain rate, etc., and alarming them.

Mitigation of this failure mode requires doing the following:

Diagnose and fix the underlying cause of why messages are not being drained from the queue. This would include actions such as fixing the connection issues between the queue and consumers or fixing the consumer so that it can process messages.
Slow down the rate of incoming calls to the producer. This can be done by implementing a rate-limiting algorithm (such as the token bucket algorithm) and then using it to reduce the rate at which requests are accepted by the system. This would give the consumers breathing room to process all queued-up messages and recover.

To protect against this failure mode, the producer can use a back-pressure mechanism to slow down calls whenever it sees the queue getting full. One way to implement the back-pressure mechanism is by having a worker that continuously polls the queue for unprocessed requests. This worker then toggles the apply-back-pressure mode on/off based on the number of unprocessed requests in the queue. Whenever a customer request is received at the producer system, the apply-back-pressure mode is checked — if it's on, the customer request is rejected; if it's off, the customer request is accepted.

One or More Consumers Are Down

This failure mode impacts the overall availability of the system by bringing it down. In the context of our example, let us say the Network Allocation System is down. The customers will be able to submit new requests for computing, but their requests will not be completed as this critical consumer system is down, thereby bringing the availability of the overall Compute Reservation System down.

The detection mechanisms of this failure mode are similar to that of the "producer is down" failure mode. We can emit availability, the number of requests being processed, and other health metrics from all consumers and alarm them when they breach critical thresholds.

The mitigation plan for this failure mode depends on what resulted in the consumer being down in the first place. We have already talked about bad code deployments in the "producer is down" failure mode, and the same mitigations apply here. Let us talk about how poison pills can take down consumers. If a consumer fails to process the data in the queue, the consumer is said to be poison-pilled. A poison-pilled consumer can recover via the following ways:

Operator intervention: An operator can execute tooling to ignore the request from the queue that the consumer poison pilled on, thereby triggering its recovery. While this may work, the consumer can take a poison pill on the next request from the queue, making this a tedious process. The real exit from these scenarios is to deploy new code to consumers to correctly process the requests that cause poisoning, but this can prolong mitigation time.
Automatically moving requests that cause poisoning to a dead letter queue: Consumers can move all requests that they cannot process to a dead letter queue. This can help consumers recover faster. The dead letter queue needs to be drained asynchronously after updating the consumer's code to handle the requests that were causing the poisoning.

The protections against this failure mode depend on the underlying reason why consumers went down in the first place. We talked about protections against bad code deployment earlier, so let us talk about the protections against poison pills here. To protect against poison pills, producers and consumers can rely on pre-commit checks. Pre-commit checks are used by producers to ensure a request will be compliant for all consumers by running the consumer parsing code before submitting the requests to the queue.

Conclusion

The above analysis of failure modes tells us that while designing a producer-consumer pattern-based system, the following protections when put in from the ground up, help increase the system's availability:

Implement rate-limiting and back-pressure support at the producer layer.
Implement external health checks for the overall system that validate the critical code paths in all sub-systems, including producers, consumers, and the shared queue.
Emit and alarm on critical health metrics for all systems.
Perform prod-preview (or canary deployments) to catch the impact of bad code deployments early on.
Perform pre-commit checks in producers to minimize chances of consumers getting poison pilled.
Build and test poison pill mitigation tooling to recover from operational events.

Producer-consumer systems have various failure modes that can result in degraded availability or complete system downtime. These systems need to be designed from the ground up with these failure modes in mind. Detective methods and mitigation plans like the ones mentioned above help provide the best uptime and, hence, an improved customer experience of these systems.

consumer Data (computing) producer systems

Opinions expressed by DZone contributors are their own.

Related

Trending