Distributed Sagas for Microservices
Learn about using the distributed saga pattern and AWS Step Functions to ensure correctness and consistency in your microservices.
Join the DZone community and get the full member experience.
Join For FreeIn this article, learn about the distributed saga pattern, how it helps ensure correctness & consistency in microservices, and how you can use AWS Step Functions as a Saga Execution Coordinator.
The Problem
Microservices has taken over software. While it has its benefits, the approach brings some added complexity.
Imagine that we’re building a travel booking platform using the microservices approach.
In the above diagram, a high-level business action (‘Booking a trip’) involves making several low-level actions to distinct microservices.
To handle client requests, we create a single-purpose edge service (a Backend for Frontend) that provides the logic of composing calls to all the downstream services. At its core, the Travel Agent Service is an orchestration layer that exposes coarse grained APIs by composing finer grained functionality provided by different microservices.
Transactions are an essential part of software applications. Without them, it would be impossible to maintain data consistency. However, in microservices we no longer have a single source of truth. State is spread across distinct services each with its own data store.
If all of the service calls completed successfully, great! But in the real world, failures occur on a regular basis. How do we handle partial executions - when a subset of requests in the high-level action failed?
With the microservices approach, you can’t just book a flight, a car, and a hotel in a single ACID transaction. To do it consistently, you would be required to create a distributed transaction.
If the flight reservation failed, would you like to keep the hotel and car? Probably not.
In this situation, we would need to implement some ad-hoc concurrency control logic in the edge Travel Agent service to handle any potential failures and try to recover from it in some way (Do we cancel the other bookings? What if the flight booking retried and succeeded later on? How do we know what state we are in?)
Without some kind of concurrency control mechanism, we risk having inconsistent data in our application - which is especially bad in distributed systems.
Understandably, this control logic can get very complicated. Dealing with partial failures and asynchrony is hard… And that’s where distributed sagas come in.
Distributed Sagas
In distributed systems, business transactions spanning multiple services require a mechanism to ensure data consistency across services. The Distributed Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback. Distributed Sagas help ensure consistency and correctness across microservices.
The term "saga" was first used in a 1987 research paper by Hector Garcia-Molina and Kenneth Salem. It’s introduced as an conceptual alternative for long lived database transactions.
More recently, sagas have been applied to help solve consistency problems in modern distributed systems - as documented in the 2015 distributed sagas paper.
A Saga represents a high-level business process (such as booking a trip) that consists of several low-level Requests that each update data within a single service. Each Request has a Compensating Request that is executed when the Request fails or the saga is aborted.
For example, our travel booking platform’s ‘Book a Trip’ saga consists of the following Requests: BookHotel
, BookCar
, and BookFlight
.
A Request has a corresponding Compensating Request that semantically undoes the Request. CancelHotel
undoes BookHotel
, CancelFlight
undoes BookFlight
and so on.
Note that certain actions are not undo-able in the conventional sense. An email that was sent to the wrong recipient cannot be un-sent. However, we can semantically undo the action by sending another email that says ‘Sorry, please ignore the previous email.
Compensating Requests semantically undoes a Request by restoring the application’s state to the original state of equilibrium before the Request was made.
Distributed Saga Guarantee
Amazingly, a distributed saga guarantees one of the following two outcomes:
- Either all Requests in the Saga are succesfully completed, or
- A subset of Requests and their Compensating Requests are executed.
The catch is for distributed sagas to work, both Requests and Compensating Requests need to obey certain characteristics:
- Requests and Compensating Requests must be idempotent, because the same message may be delivered more than once. However many times the same idempotent request is sent, the resulting outcome must be the same. An example of an idempotent operation is an UPDATE operation. An example of an operation that is NOT idempotent is a CREATE operation that generates a new
id
every time. - Compensating Requests must be commutative, because messages can arrive in order. In the context of a distributed saga, it’s possible that a Compensating Request arrives before its corresponding Request. If a
BookHotel
completes afterCancelHotel
, we should still arrive at a cancelled hotel booking (not re-create the booking!) - Requests can abort, which triggers a Compensating Request. Compensating Requests CANNOT abort, they have to execute to completion no matter what.
Sagas as State Machines
How do we define a saga? As a state machine.
State machines have been a central concept in programming for a very long time, and are ideal for coordinating many small components with fast, predictable performance.
State machines consist of different states that each perform a specific task. The state machine passes data between components, and decides the next step in the operation of the application.
Distributed Saga Implementation Approaches
There are a couple of different ways to implement a Saga transaction, but the two most popular are:
- Event-driven choreography: When there is no central coordination, each service produces and listen to other service’s events and decides if an action should be taken or not.
- Command/Orchestration: When a coordinator service is responsible for centralizing the saga’s decision making and sequencing business logic.
In this guide, we’ll look at the latter. With the orchestration approach, we define a new Saga Execution Coordinator service whose sole responsibility is to manage a workflow and invoke downstream services when it needs to.
Saga Execution Coordinator
The Saga Execution Coordinator is an orchestration service that:
- Stores & interprets a Saga’s state machine
- Executes the Requests of a Saga by talking to other services
- Handles failure recovery by executing Compensating Requests
Building such a service is a non-trivial exercise, but fortunately we can repurpose an existing service to accomplish our goal.
AWS Step Functions
AWS Step Functions is a workflow management service that can be used as an Saga Execution Coordinator to oversee the execution of our distributed sagas.
AWS Step Functions is a web service that enables you to coordinate the components of distributed applications and microservices using visual workflows. You build applications from individual components that each perform a discrete function, or task, allowing you to scale and change applications quickly.
Step Functions provides a graphical console to visualize the components of your application as a series of steps. It automatically triggers and tracks each step, and retries when there are errors, so your application executes in order and as expected, every time. Step Functions logs the state of each step, so when things do go wrong, you can diagnose and debug problems quickly.
The way the Step Functions works is as follows:
It uses its own AWS States Language domain-specific language to define state machines:
There are seven state types you can use to create your state machine. Task
can be used as Requests and Compensating Requests, while Choice
and Parallel
is used for control flow.
By definining state machines with the AWS States Language and Step Functions, your can coordinate components into workflows that match your business requirements, add retry logic, error handling, and more:
AWS also gives you a web interface you can use to debug and track the execution of your Step Functions state machines from the AWS console:
Hands-On With AWS Lambda and Step Functions
In this section, we’ll build a distributed saga for the travel booking example in the beginning of this guide.
You can clone and run the example application on GitHub.
git clone git@github.com:yosriady/serverless-sagas.git
cd serverless-sagas
serverless deploy
Note that the above example requires you to have the Serverless framework and AWS credentials set up on your machine.
After deploying, you will get a series of AWS Lambda Function ARNs that you can use as the Requests and Compensating Requests in your saga.
Go to the AWS Step Functions console and create a new step function with the states.json
file included in the example application. We create the following state machine using the AWS States Language:
In the above state machine:
- We define a starting root state
"StartAt": "BookTrip"
- In the
States
section, we define a list of states in our state machine, such asBookHotel
andCancelHotel
Task
states as well as theBookTrip
Parallel
state for executing things in parallel. - Each state can have a
Catch
clause which lets you route execution to other states if a certain type of error is returned. In our example, we have a catch configured forBookTrip
so that if any of the branchesBookHotel
,BookFlight
, andBookCar
errors out the compensating requestsCancelHotel
,CancelFlight
, andCancelCar
is executed. - We also use the
Catch
to make our compensating requests retry itself if it failed to execute.
Remember to update any "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-cancelCar"
to use your deployed AWS Lambda function ARNs from the previous step.
Once created, you will then be able to visualize and execute your state machine on the AWS console:
AWS Step Functions Summary
AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. The service makes it simple to manage distributed sagas that communicate with multiple components.
AWS Step Functions manages the operations and underlying infrastructure for you to help ensure your application is available at any scale.
In Closing
We learned about what Distributed Sagas are (a pattern for handling failure in microservices) and how it solves the problem of distributed transactions. It helps ensure correctness and consistency in microservices.
We also learned about Saga Execution Coordinators, and how you can get started with defining state machines using the AWS States Language and AWS Step Functions.
Thank you for reading!
Published at DZone with permission of Yos Riady, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments