Techniques for Chaos Testing Your Redis Cluster
This article explores a few techniques to create chaos testing scenarios on a Redis cluster and uncover potential weaknesses in a controlled way.
Join the DZone community and get the full member experience.
Join For FreeFor large-scale, distributed systems, chaos testing becomes an essential tool. It helps uncover potential failure points and strengthen overall system resilience. This article delves into practical and straightforward methods for injecting chaos into your Redis cluster, enabling you to proactively identify and address weaknesses before they cause real-world disruptions.
Set Up
-
Create a Redis cluster
You can follow this article to set up a Redis cluster locally before taking it to production
- Then generate a load on your Redis cluster. You can use memtier benchmark or any other framework to generate load on your Redis cluster.
- Inject the following chaos scenarios into your Redis cluster to test its performance and recovery. If the results do not meet your expectations, apply fixes and repeat the tests to ensure the solutions work, ultimately enhancing the reliability of your cluster.
Let's explore a few techniques below to create chaos test scenarios.
Promote Replica to Primary (Failover)
Cluster Failover
Initiate this command on a replica to promote this replica as a primary and the original primary will become the replica.
Here’s What Happens Under the Hood
Once the command is invoked, the primary stops processing new requests. The replica initiates the failover process and replicates the data to match the primary's state. After this synchronization, along with updating necessary configurations and epochs, the replica begins serving as the new primary, while the original primary transitions to a replica role.
In the above screenshot, we can observe a Redis node with ID 2b570b9c76127bdf38955ea7181ff8f8bbe62cdf (port 30001)
is a replica of node id equal to aa24dc9d601a2ae348e4902ed8b38a08f915f21c
.
After invoking the command we can see in the screenshot below that this node (2b570b9c76127bdf38955ea7181ff8f8bbe62cdf (port 30001)
has become the primary and original primary (with node id a24dc9d601a2ae348e4902ed8b38a08f915f21c)
has become the replica.
In normal circumstances, clients connected to the cluster should not experience any issues, as replicas are typically very close to the primary node in the state. However, if you inject a failover scenario and observe issues like latency spikes or decreased throughput, it's crucial to investigate the root cause. This could indicate potential bottlenecks in your cluster that require further optimization.
Remove a Replica
In this scenario, we remove a replica node so that it is not available for any operation. Removal can be of two types namely: Soft removal and Hard removal.
Soft (Temporary) Replica Removal
In this case, we just stop the replica node so it becomes unavailable but it is still a part of the cluster. So in other words, it is still a part of the cluster topology.
We can use the following command to stop:
redis-cli -p <port> shutdown
As we can see from the above screenshot, the replica node is now in a “fail” state which indicates that this node is not available although it is still a part of cluster topology.
To start it back we can run the following command.
redis-server --port <port>
Hard (Permanent) Replica Removal
In this case, the replica is removed from the cluster itself. Hence, calling it a hard removal. We can use the “CLUSTER FORGET <node_id>
” command as shown below. This command will update the node table of the current node on which the command is run and remove the node_id
supplied from its node table. To completely remove the node from the cluster we need to run this command on all the nodes of the cluster as shown below.
# Pseudo code
for port in <list of ports>; do
# Run the CLUSTER FORGET command for each node
redis-cli -p $port CLUSTER FORGET <node_id_of_the_node_to_be_removed>
done
Remove a Primary
Following the same steps as above to remove a replica, we can also remove a primary node. This can be done through soft removal (where the node is marked as failed but remains part of the cluster topology) or hard removal (where the node is completely removed from the cluster and its topology) as stated above.
The key difference is that this removal will trigger a replica to take over as the new primary.
Special Chaos Scenario When Both Replica and Primary Are Removed
This is a special chaos scenario designed to test the reliability of your system and the behavior of different clients when both the replica and primary are removed. You can follow these steps to create this scenario.
-
Update the redis.conf file so that the cluster is available when part of the key slots are not covered. For that update the following config as “no” in the redis.conf.
Cluster-require-full-coverage no
Remove the replica using CLUSTER FORGET
command as mentioned above, so that it is removed from the cluster topology.
-
Stop the primary node using the following command to keep it in the cluster topology with a "fail" status. This will cause clients to continue sending requests to the node, providing an opportunity to test cluster stability and observe client behavior based on their versions in this chaos test scenario.
redis-cli -p <port> shutdown
Conclusion
We have explored a few straightforward techniques to create chaos scenarios on Redis backed for testing cluster stability and client behavior in those situations. However, please exercise caution, as these operations and commands are risky. Only perform them in test environments, ensure safeguards are in place, and execute them in a controlled manner.
References
Opinions expressed by DZone contributors are their own.
Comments