Chaos Engineering — Simulate AZ Failures on AWS
In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.
Join the DZone community and get the full member experience.
Join For FreeChaos engineering is about introducing turbulent conditions that systems are likely to face in production environments. These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing. In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome. The experiment gives us new information about the system, which can then be used for the betterment of systems.
In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS. Highly available applications need to be resilient against AZ failures. Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.
Chaos Toolkit gives a good framework for defining chaos experiments. I have forked chaostoolkit-aws repo and added AZ failures probes and methods in the ec2 module. I have used boto3 python aws library to create these experiments. You can access the code here — AZ Failure Git Repo
This is how an AZ failure experiment comes together -
- Steady State Hypothesis — Before we kick off the experiment, we want to establish a steady-state hypothesis, that is "what normal looks like". In this case, I have assumed that if I can successfully SSH into EC2 instance then there is no AZ failure right now and hence a normal state.
xxxxxxxxxx
# .... Refer Code repo for full function.....
logger.info('Starting SSH into ec2 instance — ' + instance.instance_id)
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
privkey = paramiko.RSAKey.from_private_key_file(pem_file_path)
try :
ssh.connect(instance.public_dns_name, username='ec2-user', pkey=privkey, timeout=10)
except :
logger.info('SSH Times out — waited for 10 seconds')
return False
- Action: Simulate AZ failure — To simulate AZ failure, I have created a blackhole ACL which is then attached to the subnet of our instance. This blackhole ACL has one rule which disallows all ingress traffic covering CIDR — ‘0.0.0.0/0’ and all from and to ports.
x
# .... Refer code repo for full function....
logger.info('Simulating AZ failure for — ' + subnet.availability_zone)
# Create new network ACL
acl_response = create_network_acl(vpc_id)
logger.info('Created new network ACL — ' + str(acl_response))
acl_id = acl_response['NetworkAcl']['NetworkAclId']
# Create blackhole ACL
logger.info('Creating blackhole ACL')
create_network_acl_ingress_entry(acl_id, rule_num=1, protocol="-1", cidr_block="0.0.0.0/0", from_port=-30000, to_port=30000, allow=True)
NetworkAclAssociationId = None
prev_NetworkAclId = None
# get list of Network ACLs
nw_acl_dict = get_network_acls()
for x in nw_acl_dict['NetworkAcls'] :
for y in x['Associations'] :
if y['SubnetId'] == subnet_id :
NetworkAclAssociationId = y['NetworkAclAssociationId']
prev_NetworkAclId = y['NetworkAclId']
d2["acl_id"]= prev_NetworkAclId
d2["blackhole_acl_id"] = acl_id
json.dump(d2, open("exp_data1.txt", 'w'))
logger.info('Replacing Original ACl — ' + prev_NetworkAclId + ' with blackhole ACL ' + acl_id + ' to subnet' + subnet_id)
# Associate Subnet with blackhole ACL
change_network_acl_association(acl_id, NetworkAclAssociationId)
Check steady-state hypothesis again — Our steady-state hypothesis was about successful SSH into the EC2 instance. Since AZ failure is now simulated, SSH will timeout and hence steady-state hypothesis is broken. This failure means that our system will not survive AZ failures. We have new information available, which can now be used to make it more resilient.
Rollback AZ failure — It is important to also rollback the AZ failure. In this, we attach the previous ACL to the subnet and delete the blackhole ACL.
x
# Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
def rollback_az_failure():
d2 = json.load(open("exp_data1.txt"))
prev_NetworkAclId = d2["acl_id"]
blackhole_acl_id = d2["blackhole_acl_id"]
subnet_id = d2["subnet_id"]
logger.info('Rolling back ACL for subnet ' + subnet_id + ' from blackhole acl — '+ blackhole_acl_id + ' to original ACl — ' + prev_NetworkAclId)
nw_acl_dict = get_network_acls()
for x in nw_acl_dict['NetworkAcls'] :
for y in x['Associations'] :
if y['SubnetId'] == subnet_id :
NetworkAclAssociationId = y['NetworkAclAssociationId']
change_network_acl_association(prev_NetworkAclId, NetworkAclAssociationId)
logger.info(' Removing Black hole ACl — ' + blackhole_acl_id)
delete_network_acl(blackhole_acl_id)
Chaos toolkit framework allows us to piece this experiment together. Below YAML is how you do it in Chaos toolkit
x
version1.0.0
title What happens if there is an AZ Failure
description Simulate AZ failure by creating blackhole Network ACL
configuration
aws_region us-east-2
steady-state-hypothesis
title SSH access to EC2 machine is working
probes
type probe
name Check SSH Access to EC2 Instance
tolerancetrue
provider
type python
module chaosaws.ec2.probes
func ssh_test
arguments
pem_file_path Test-Chaos.pem
method
type action
title Simulate AZ Failure by creating Blackhole ACL and attaching to Subnet
name AZ Failure Action creates a blackhole ACL, attaches it to a subnet thereby simulating AZ failure
provider
type python
module chaosaws.ec2.actions
func az_failure
rollbacks
type action
title Rollback AZ failure and restore original ACL to Subnet
name Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
provider
type python
module chaosaws.ec2.actions
func rollback_az_failure
Happy Coding!
Opinions expressed by DZone contributors are their own.
Comments