Achieving Fault Tolerance With Resilience4j

Learn how Resilience4j, a fault tolerance library, can help design each layer of your application to handle errors and operate through failures.

Bohdan Storozhuk

Aug. 09, 17 · Tutorial

Likes (18)

Comment

Save

35.5K Views

This is the first article of a short series about the Resilience4j library. It provides a short introduction to the Resilience4j functionality, its unique features, and the motivation behind it. All other articles of the series will share some insights about library internals like data structures, algorithms, and other tricks.

Intro

Resilience4j is a fault tolerance library designed for Java 8 and functional programming. It is lightweight, modular, and really fast. We will talk about its modules and functionality later, but first, let's briefly discuss why you should even bother with fault tolerance.

Fault Tolerance

Fault tolerance is basically the ability of a system to operate properly in case of the failure of some of its components. It sounds easy, but it's not so easy to achieve, because if you're aiming to make some system fault tolerant, it should be done on all levels and subsystems as a part of the design. And it is not only about proper error handling; you should also keep your failure domains as small as possible, work on fault isolation and the possibility of self-stabilization... Error handling seems to be the easiest problem here, but I'll disappoint you.

Error Handling Is Still a Thing

There is a very interesting paper called "What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems." In this study, error-handling bugs are the second largest category (18%) after logic bugs. The authors break down error-handling bugs into three classes of problems:

Error/Failure Detection - Errors are often ignored and incorrectly detected;
Error Propagation - This class of problems arises in layered systems where error detection and error handling code are located on different layers and there is propagation problem across layers;
Error Handling - Sometimes it's not clear how to handle rare corner-cases, and the lack of such specifications leads to error-prone code.

Example

Let's continue with a small example of a subsystem that could potentially fail:

// Simulates a microservice for user management
public interface UserService {
  Picture fetchProfilePicture(String userId);
}

We Can Do Better

So we all know that things break down from time to time, and we often turn a blind eye to it, like this:

try {
    profilePicture = userService.fetchProfilePicture(userId);
} catch(Exception e) {
    Logger.error("The world is not a perfect place ", e);
}

Yes, logging is a very important aspect of failure detection, but we can be a little bit smarter about it, and this is where Resilience4j can help you. The key word here is "help." The library won't automatically fix all possible bugs; all the important work and choices are still on you. The library can only make this "hard way" brighter. There is an unlimited count of additional actions we can do in case of failure, except the logging. Here are a few options just off the top of my head:

Define "fallback" operations that can go to another host, query backup DB, or reuse the latest successful response. The example uses Vavr’s Try Monad to recover from an exception and invoke another lambda expression as a fallback:

Supplier<Picture> fetchTargetPicture = () -> userService.fetchProfilePicture(targetID);
// in case of failure you'll receive some stub picture
Picture profilePicture = Try.ofSupplier(fetchTargetPicture)
    .recover(throwable -> Picture.defaultForProfile()).get();

Apply automatic retrying and configure max attempts count and wait duration before retries:

Supplier<Picture> fetchTargetPicture = () -> userService.fetchProfilePicture(targetID);

RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build();
Retry retry = Retry.of("userService", retryConfig);
// it will try to fetch image 3 times with 500ms pause between retries
fetchTargetPicture = Retry.decorateSupplier(retry, fetchTargetPicture);

Picture profilePicture = Try.ofSupplier(fetchTargetPicture)
    .recover(throwable -> Picture.defaultForProfile()).get();

Use circuit breaking, where you can track error rates of some service/component and, in case of problems, stop all operations with it to help it recover:

Supplier<Picture> fetchTargetPicture = () -> userService.fetchProfilePicture(targetID);

CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("userService");
// it will prevent all calls to original fetchTargetPicture in case of UserService failure
fetchTargetPicture = CircuitBreaker 
    .decorateSupplier(circuitBreaker, fetchTargetPicture);

RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build();
Retry retry = Retry.of("userService", retryConfig);
fetchTargetPicture = Retry.decorateSupplier(retry, fetchTargetPicture);

Picture profilePicture = Try.ofSupplier(fetchTargetPicture)
    .recover(throwable -> Picture.defaultForProfile()).get();

Send an event directly to the monitoring system to speed up problem detection:

Supplier<Picture> fetchTargetPicture = () -> userService.fetchProfilePicture(targetID);

CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("userService");
fetchTargetPicture = CircuitBreaker 
    .decorateSupplier(circuitBreaker, fetchTargetPicture);

// they know what to do with it
circuitBreaker.getEventPublisher()
  .onError(event -> Houston.weHaveAProblem(event));

RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build();
Retry retry = Retry.of("userService", retryConfig);
fetchTargetPicture = Retry.decorateSupplier(retry, fetchTargetPicture);

Picture profilePicture = Try.ofSupplier(fetchTargetPicture)
    .recover(throwable -> Picture.defaultForProfile()).get();

Instant event based notifications are really great, but in general, you should always have a separate monitoring system that will poll all health checks of your nodes and will watch for any anomalies in your metrics. Resilience4j has add-on modules for integration with Prometheus and Dropwizard Metrics, so you can easily publish your metrics to these systems. For example:

final MetricRegistry collectorRegistry = new MetricRegistry();
collectorRegistry.registerAll(CircuitBreakerMetrics.ofCircuitBreaker(circuitBreaker));

Now you can see the uniqueness of Resilience4j from the API standpoint: it's just decoration of methods references or any functional interfaces by using higher order functions. For those of you who love FP, it should be very appealing. If you like OOP more, just use the Proxy pattern, which is especially good with AOP interceptors.

There are no library interfaces that you should implement in order to guard some operations. You aren't forced to delegate all operations to some separate thread pool. From this standpoint, Resilience4j is very flexible and can be used with any programming paradigm or concurrency model.

Main Modules

Resilience4j can help you to apply any fault tolerance ideas. It also has some bug prevention capabilities where you can restrict calling rate of some method to be not higher than N [req/timeUnit] or limit the number of concurrent executions. Everything is highly configurable and there are metrics in place (where it makes sense). All features have very low overhead, and CircuitBreaker, RateLimiter, and Bulkhead can be configured to make them completely garbage free. By using the internal event system, you can implement immediate reaction to any problem or failure that will notify you about it.

Here is a full list of core and add-on modules:

Core modules:

resilience4j-circuitbreaker
resilience4j-ratelimiter
resilience4j-bulkhead
resilience4j-retry
resilience4j-cache

Add-on modules:

resilience4j-metrics: Dropwizard Metrics exporter
resilience4j-prometheus: Prometheus Metrics exporter
resilience4j-spring-boot: Spring Boot Starter
resilience4j-ratpack: Ratpack Starter
resilience4j-retrofit: Retrofit Call Adapter Factories
resilience4j-vertx: Vertx Future decorator
resilience4j-consumer: Circular Buffer Event consumer
resilience4j-rxjava2: integration of internal event system with rxjava2

Additional Resources

If you are interested, please visit our GitHub page or take a look at User Guide.

For Spring Boot users, we have a starter module and a small demo project.

Fault tolerance Fault (technology) Spring Framework

Opinions expressed by DZone contributors are their own.

Related

Trending