Chaos Monkey for Spring Boot Microservices

Learn how to use the Chaos Monkey library to test the resiliency and performance of your Spring Boot microservices.

Piotr Mińkowski

May. 24, 18 · Tutorial

Likes (40)

Comment

Save

27.7K Views

How many of you have never encountered a crash or a failure of your systems in a production environment? Certainly, each one of you, sooner or later, has experienced it. If we are not able to avoid a failure, the solution seems to be maintaining our system in a state of permanent failure. This concept underpins the tool invented by Netflix to test the resilience of its IT infrastructure — Chaos Monkey. A few days ago, I came across the solution, based on the idea behind Nelflix's tool, designed to test Spring Boot applications. Such a library has been implemented by Codecentric. Until now, I recognize them only as the authors of other interesting solution dedicated for Spring Boot ecosystem - Spring Boot Admin. I have already described this library in one of my previous articles, Monitoring Microservices With Spring Boot Admin.

Today, I'm going to show you how to include Codecentric's Chaos Monkey in your Spring Boot application, and then implement chaos engineering in a sample system consisting of some microservices. The Chaos Monkey library can be used together with Spring Boot 2.0, and the current release version of it is 1.0.1. However, I'll implement the sample using version 2.0.0-SNAPSHOT, because it has some new interesting features not available in earlier versions of this library. In order to be able to download the SNAPSHOT version of Codecentric's Chaos Monkey library, you have to remember to include this Maven repository in your repositories in the pom.xml.

1. Enable Chaos Monkey for an Application

There are two required steps for enabling Chaos Monkey for a Spring Boot application. First, let's add the library chaos-monkey-spring-boot to the project's dependencies.

<dependency>
<groupId>de.codecentric</groupId>
<artifactId>chaos-monkey-spring-boot</artifactId>
<version>2.0.0-SNAPSHOT</version>
</dependency>

Then, we should activate the profile chaos-monkey on application startup.

$ java -jar target/order-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey

2. Sample System Architecture

Our sample system consists of three microservices, each started in two instances, and a service discovery server. Microservices register themselves against a discovery server and communicate with each other through an HTTP API. The Chaos Monkey library is included in every single instance of all running microservices, but not the discovery server. Here's a diagram that illustrates the architecture of our sample system:

The source code of the sample application is available on GitHub in the repository sample-spring-chaosmonkey. After cloning this repository and building it using mnv clean install, you should first run discovery-service. Then, run two instances of every microservice on different ports by setting the -Dserver.port property with an appropriate number. Here's a set of my running commands:

$ java -jar target/discovery-service-1.0-SNAPSHOT.jar
$ java -jar target/order-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey
$ java -jar -Dserver.port=9091 target/order-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey
$ java -jar target/product-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey
$ java -jar -Dserver.port=9092 target/product-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey
$ java -jar target/customer-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey
$ java -jar -Dserver.port=9093 target/customer-service-1.0-SNAPSHOT.jar --spring.profiles.active=chaos-monkey

3. Process Configuration

In version 2.0.0-SNAPSHOT of the chaos-monkey-spring-boot library, Chaos Monkey is, by default, enabled for applications that include it. You may disable it using the property chaos.monkey.enabled. However, the only assault which is enabled by default is latency. This type of assault adds a random delay to the requests processed by the application in the range determined by the properties chaos.monkey.assaults.latencyRangeStart and chaos.monkey.assaults.latencyRangeEnd. The number of attacked requests is dependent on the property chaos.monkey.assaults.level, where 1 means each request and 10 means each 10th request. We can also enable exception and appKiller assaults for our application. For simplicity, I set the configuration for all the microservices. Let's take a look at the settings provided in application.yml

chaos:
  monkey:
    assaults:
  level: 8
  latencyRangeStart: 1000
  latencyRangeEnd: 10000
  exceptionsActive: true
  killApplicationActive: true
watcher:
  repository: true
      restController: true

In theory, the configuration visible above should enable all three available types of assaults. However, if you enable latency and exceptions, killApplication will never happen. Also, if you enable both latency and exceptions, all the requests sentd to the application will be attacked, no matter which level is set with the chaos.monkey.assaults.level property. It is important to remember to activate the restController watcher, which is disabled by default.

4. Enable Spring Boot Actuator Endpoints

Codecentric implements a new feature in the version 2.0 of their Chaos Monkey library - the endpoint for Spring Boot Actuator. To enable it for our applications we have to activate it following actuator convention by setting the property management.endpoint.chaosmonkey.enabled to true. Additionally, beginning with version 2.0 of Spring Boot, we have to expose that HTTP endpoint to be available after application startup.

management:
  endpoint:
    chaosmonkey:
      enabled: true
  endpoints:
    web:
      exposure:
        include: health,info,chaosmonkey

chaos-monkey-spring-boot provides several endpoints, allowing you to check out and modify the configuration. You can use the method GET /chaosmonkey to fetch the whole configuration of library. You may also disable Chaos Monkey after starting the application by calling the method POST /chaosmonkey/disable. The full list of available endpoints is listed here.

5. Running Applications

All the sample microservices store data in MySQL, so the first step is to run the MySQL database locally using its Docker image. The Docker command visible below also creates database and user with password.

$ docker run -d --name mysql -e MYSQL_DATABASE=chaos -e MYSQL_USER=chaos -e MYSQL_PASSWORD=chaos123 -e MYSQL_ROOT_PASSWORD=123456 -p 33306:3306 mysql

After running all the sample applications, where all microservices are multiplied in two instances listening on different ports, our environment looks like the figure below.

You will see the following information in your logs during application boot:

We may check out the Chaos Monkey configuration settings for every running instance of the application by calling the following actuator endpoint:

6. Testing the System

For testing purposes, I used a popular performance testing library: Gatling. It creates 20 simultaneous threads, which calls POST /orders and GET /order/{id} methods exposed by order-service via API gateway 500 times per each thread.

class ApiGatlingSimulationTest extends Simulation {

  val scn = scenario("AddAndFindOrders").repeat(500, "n") {
        exec(
          http("AddOrder-API")
            .post("http://localhost:8090/order-service/orders")
            .header("Content-Type", "application/json")
            .body(StringBody("""{"productId":""" + Random.nextInt(20) + ""","customerId":""" + Random.nextInt(20) + ""","productsCount":1,"price":1000,"status":"NEW"}"""))
            .check(status.is(200),  jsonPath("$.id").saveAs("orderId"))
        ).pause(Duration.apply(5, TimeUnit.MILLISECONDS))
        .
        exec(
          http("GetOrder-API")
            .get("http://localhost:8090/order-service/orders/${orderId}")
            .check(status.is(200))
        )
  }

  setUp(scn.inject(atOnceUsers(20))).maxDuration(FiniteDuration.apply(10, "minutes"))

}

A POST endpoint is implemented inside OrderController in the add(...) method. It calls find methods exposed by customer-service and product-service using OpenFeign clients. If a customer has sufficient funds and there are still products in stock, it accepts the order and performs changes for the customer and product using PUT methods. Here's the implementation of two methods tested by a Gatling performance test:

@RestController
@RequestMapping("/orders")
public class OrderController {

 @Autowired
 OrderRepository repository;
 @Autowired
 CustomerClient customerClient;
 @Autowired
 ProductClient productClient;

 @PostMapping
 public Order add(@RequestBody Order order) {
  Product product = productClient.findById(order.getProductId());
  Customer customer = customerClient.findById(order.getCustomerId());
  int totalPrice = order.getProductsCount() * product.getPrice();
  if (customer != null && customer.getAvailableFunds() >= totalPrice && product.getCount() >= order.getProductsCount()) {
   order.setPrice(totalPrice);
   order.setStatus(OrderStatus.ACCEPTED);
   product.setCount(product.getCount() - order.getProductsCount());
   productClient.update(product);
   customer.setAvailableFunds(customer.getAvailableFunds() - totalPrice);
   customerClient.update(customer);
  } else {
   order.setStatus(OrderStatus.REJECTED);
  }
  return repository.save(order);
 }

 @GetMapping("/{id}")
 public Order findById(@PathVariable("id") Integer id) {
  Optional order = repository.findById(id);
  if (order.isPresent()) {
   Order o = order.get();
   Product product = productClient.findById(o.getProductId());
   o.setProductName(product.getName());
   Customer customer = customerClient.findById(o.getCustomerId());
   o.setCustomerName(customer.getName());
   return o;
  } else {
   return null;
  }
 }

 // ...

}

Chaos Monkey sets random latency between 1000 and 10000 milliseconds (as shown in step 3). It is important to change the default timeouts for Feign and Ribbon clients before starting a test. I decided to set readTimeout to 5000 milliseconds. It will cause some delayed requests to be time-outed, while some will succeed (around 50%-50%). Here is the timeouts configuration for Feign client:

feign:
  client:
    config:
      default:
        connectTimeout: 5000
        readTimeout: 5000
  hystrix:
    enabled: false

Here's Ribbon client's timeouts configuration for the API gateway. We have also changed the Hystrix settings to disable circuit breaker for Zuul.

ribbon:
  ConnectTimeout: 5000
  ReadTimeout: 5000

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 15000
      fallback:
        enabled: false
      circuitBreaker:
        enabled: false

To launch a Gatling performance test, go to the performance-test directory and run gradle loadTest. Here's a result generated for the settings latency assaults. Of course, we can change this result by manipulating Chaos Monkey latency values or Ribbon and Feign timeout values.

Here's a Gatling graph with average response times. The results do not look good. However, we should remember that a single POST method from order-service calls two methods exposed by product-service and two methods exposed by customer-service.

Here's the next Gatling result graph; this time it illustrates a timeline with error and success responses. All HTML reports generated by Gatling during performance test are available under the directory performance-test/build/gatling-results

Spring Framework Chaos Monkey Spring Boot Chaos engineering microservice application

Published at DZone with permission of Piotr Mińkowski, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending