Chaos Engineering — Simulate AZ Failures on AWS

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.

Gaurav Gupta

Jun. 08, 20 · Tutorial

Likes (7)

Comment

Save

12.2K Views

Chaos engineering is about introducing turbulent conditions that systems are likely to face in production environments. These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing. In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome. The experiment gives us new information about the system, which can then be used for the betterment of systems.

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS. Highly available applications need to be resilient against AZ failures. Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.

Chaos Toolkit gives a good framework for defining chaos experiments. I have forked chaostoolkit-aws repo and added AZ failures probes and methods in the ec2 module. I have used boto3 python aws library to create these experiments. You can access the code here — AZ Failure Git Repo

This is how an AZ failure experiment comes together -

Steady State Hypothesis — Before we kick off the experiment, we want to establish a steady-state hypothesis, that is "what normal looks like". In this case, I have assumed that if I can successfully SSH into EC2 instance then there is no AZ failure right now and hence a normal state.

    Python
   
xxxxxxxxxx

 # .... Refer Code repo for full function.....
  
  logger.info('Starting SSH into ec2 instance — ' + instance.instance_id)
  ssh = paramiko.SSHClient()
  ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
  privkey = paramiko.RSAKey.from_private_key_file(pem_file_path)
  try :
    ssh.connect(instance.public_dns_name, username='ec2-user', pkey=privkey, timeout=10)
  except :
    logger.info('SSH Times out — waited for 10 seconds')
    return False

Action: Simulate AZ failure — To simulate AZ failure, I have created a blackhole ACL which is then attached to the subnet of our instance. This blackhole ACL has one rule which disallows all ingress traffic covering CIDR — ‘0.0.0.0/0’ and all from and to ports.

    Python
   
          x
         
# .... Refer code repo for full function....
logger.info('Simulating AZ failure for — ' + subnet.availability_zone)
# Create new network ACL
acl_response = create_network_acl(vpc_id)
logger.info('Created new network ACL — ' + str(acl_response))
acl_id = acl_response['NetworkAcl']['NetworkAclId']
# Create blackhole ACL
logger.info('Creating blackhole ACL')
create_network_acl_ingress_entry(acl_id, rule_num=1, protocol="-1", cidr_block="0.0.0.0/0", from_port=-30000, to_port=30000, allow=True)
NetworkAclAssociationId = None
prev_NetworkAclId = None
# get list of Network ACLs
nw_acl_dict = get_network_acls()
for x in nw_acl_dict['NetworkAcls'] :
  for y in x['Associations'] :
    if y['SubnetId'] == subnet_id :
      NetworkAclAssociationId = y['NetworkAclAssociationId']
      prev_NetworkAclId = y['NetworkAclId']
      d2["acl_id"]= prev_NetworkAclId
      d2["blackhole_acl_id"] = acl_id
      json.dump(d2, open("exp_data1.txt", 'w'))
logger.info('Replacing Original ACl — ' + prev_NetworkAclId + ' with blackhole ACL ' + acl_id + ' to subnet' + subnet_id)
#  Associate Subnet with blackhole ACL
change_network_acl_association(acl_id, NetworkAclAssociationId)

Check steady-state hypothesis again — Our steady-state hypothesis was about successful SSH into the EC2 instance. Since AZ failure is now simulated, SSH will timeout and hence steady-state hypothesis is broken. This failure means that our system will not survive AZ failures. We have new information available, which can now be used to make it more resilient.
Rollback AZ failure — It is important to also rollback the AZ failure. In this, we attach the previous ACL to the subnet and delete the blackhole ACL.

    Python
   
          x
         
# Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
def rollback_az_failure():
    d2 = json.load(open("exp_data1.txt"))
    prev_NetworkAclId = d2["acl_id"]
    blackhole_acl_id = d2["blackhole_acl_id"]
    subnet_id = d2["subnet_id"]
    logger.info('Rolling back ACL for subnet ' + subnet_id  + ' from blackhole acl — '+ blackhole_acl_id + ' to original ACl — ' + prev_NetworkAclId)
    nw_acl_dict = get_network_acls()
    for x in nw_acl_dict['NetworkAcls'] :
        for y in x['Associations'] :
            if y['SubnetId'] == subnet_id :
                NetworkAclAssociationId = y['NetworkAclAssociationId']
    change_network_acl_association(prev_NetworkAclId, NetworkAclAssociationId)
    logger.info(' Removing Black hole ACl — ' + blackhole_acl_id)
    delete_network_acl(blackhole_acl_id)

Chaos toolkit framework allows us to piece this experiment together. Below YAML is how you do it in Chaos toolkit

    YAML
   
          x
         
version: 1.0.0
title: What happens if there is an AZ Failure
description: Simulate AZ failure by creating blackhole Network ACL
configuration:
  aws_region: us-east-2
steady-state-hypothesis:
  title: SSH access to EC2 machine is working
  probes:
    — type: probe
      name: Check SSH Access to EC2 Instance
      tolerance: true
      provider:
        type: python
        module: chaosaws.ec2.probes
        func: ssh_test
        arguments:
          pem_file_path: Test-Chaos.pem
method:
- type: action
  title: Simulate AZ Failure by creating Blackhole ACL and attaching to Subnet
  name: AZ Failure Action creates a blackhole ACL, attaches it to a subnet thereby simulating AZ failure
  provider:
    type: python
    module: chaosaws.ec2.actions
    func: az_failure
rollbacks:
- type: action
  title: Rollback AZ failure and restore original ACL to Subnet
  name: Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
  provider:
    type: python
    module: chaosaws.ec2.actions
    func: rollback_az_failure

Happy Coding!

AWS Chaos engineering Steady state (chemistry)

Opinions expressed by DZone contributors are their own.

Related

Trending

Chaos Engineering — Simulate AZ Failures on AWS

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.

Related

Partner Resources