AWS Serverless Lambda Resiliency: Part 2
We continue to address the patterns and considerations for resilience in cloud-native serverless systems with 5 additional scenarios.
Join the DZone community and get the full member experience.
Join For FreeIn this series of articles, we are addressing the patterns for the resilience of cloud-native serverless systems. Please refer to article one for the introduction of this series.
Pattern 3: Lambda Synchronous Invocation and Circuit State Validated by API Gateway and Leveraging Fallback Switchover
In this option, let's consider Lambdas serving synchronous requests through the API gateway. In this option, the API Gateway starts sending traffic to the fallback service when there are issues with the external service.
Circuit is closed:
- The API Gateway calls the Lambda function.
- Lambda function calls the external service.
- The calls to the external service are observed using CloudWatch (based on errors and error rates).
Circuit switches to open:
- Now, let's say, there are issues with the external service.
- A new request comes in through the API Gateway, which then calls the Lambda function.
- Lambda function calls the external service.
- The calls to the external service are observed using CloudWatch.
- CloudWatch identifies the issues (failure/error/error code) and raises alarm/event.
- The event source configuration in CloudWatch triggers the Circuit breaker Lambda.
- Circuit Breaker lambda creates an item in the DynamoDB, which will have a duration for the open circuit. This is set using the item's Time To Live. The Lambda will then use the SDK to set the API gateway to call the fallback service.
- Circuit is now switched to an open state.
Circuit is open:
- A new request comes in through the API Gateway.
- The API gateway invokes the fallback service.
Circuit switches to closed:
- The TTL of the item (representing the open state) in DynamoDB expires.
- The TTL expiry will result in the DynamoDB stream triggering the Circuit Breaker Lambda.
- The Lambda will use the SDK and point back to the original Lambda function.
- Circuit is now switched to a closed state.
When may this approach be applicable?
- This approach works for synchronous services.
- This option does not invoke the original Lambda function when the circuit is open. Hence the cost involved with Lambda invocations is avoided when the circuit is open. However, there will still be cost involved to invoke the fallback service.
- Calling a fallback service is viable when the external service has issues.
- The Lambda function is not aware of the circuit breaker.
- The circuit is either open or closed. There is no half-open state.
Based on the nature of the Lambda service, the canary deployment feature may be suitable so that you can send part of the traffic to the fallback service and the remaining to the original Lambda. The traffic to the original Lambda can be iteratively increased to avoid full load as soon as the external service recovers. This will enable the circuit to achieve a half-open state.
Pattern 4: Lambda Synchronous Invocation and Circuit State Validated by API Gateway and Leveraging Interceptor Function
Circuit is closed:
- The API Gateway calls the Interceptor Lambda function.
- Interceptor Lambda function checks the status of the circuit (which is closed) from DynamoDB.
- As the circuit is closed, the Interceptor Lambda function further calls the Lambda function.
- Lambda function calls the external service.
- The calls to the external service are observed using CloudWatch (based on errors and error rates).
Circuit switches to open:
- Now, let's say, there are issues with the external service.
- A new request comes in through the API Gateway, which then calls the Interceptor Lambda function.
- Interceptor Lambda function checks the status of the circuit (which is still closed) in DynamoDB.
- Interceptor Lambda function calls the Lambda function.
- Lambda function calls the external service.
- The calls to the external service are observed using CloudWatch.
- CloudWatch identifies the issues (failure/error/error code) and raises alarm/event.
- The event source configuration in CloudWatch triggers the Interceptor Lambda function.
- Interceptor Lambda function creates an item in the DynamoDB, which will have a duration for the open circuit. We will set the duration using the item's Time To Live.
- Circuit is now switched to an open state.
Circuit is open:
- A new request comes in through the API Gateway, which then calls the Interceptor Lambda function.
- Interceptor Lambda function checks the status of the circuit (which is open) from DynamoDB.
- As the circuit is open, the Lambda function invokes the fallback service (or alternatively returns an error).
Circuit switches to half-open:
- The TTL of the item (representing the open state) in DynamoDB expires.
- The TTL expiry will result in the DynamoDB stream triggering the Interceptor Lambda function.
- The Interceptor Lambda function creates an item in the DynamoDB, which will have a duration for which the circuit is half open and an invocation limit on the Lambda function for that specific duration. We set the duration using the item's Time To Live.
- Circuit is now switched to a half-open state.
Circuit is half-open:
- A new request comes in through the API Gateway, which then calls the Interceptor Lambda function.
- Interceptor Lambda function checks the status of the circuit (which is half-open) from DynamoDB.
- The Interceptor Lambda function checks if the invocation limit is reached as the circuit is half-open. The Interceptor lambda function calls the fallback service if the invocation limit is reached. If the invocation limit is not reached, the lambda function decreases the invocation limit and then calls the Lambda function.
- If the CloudWatch identifies any issues (failure/error/error code) and raises the alarm/event, which will result in the invocation of the Lambda to re-open the circuit.
Circuit switches to closed:
- Let's say the external service handled the requests as per the service objective.
- The TTL of the item (representing the half-open state) in DynamoDB expires.
- The TTL expiry will result in the DynamoDB stream triggering the Interceptor Lambda function.
- The Interceptor Lambda function can now increase the invocation limit OR switch the circuit to Closed. Let's consider the design here where the circuit is closed.
- Circuit is now switched to a closed state.
- The Lambda function starts showing normal behavior expected when the circuit is closed.
We can use different approaches to set the open/half-open state time if issues are identified repeatedly. The duration can be based on exponential back-off or random jitter.
When may this approach be applicable?
- This approach works for synchronous services.
- It is acceptable from a business standpoint to function with reduced functionality.
- There is a fallback function service, which can be an alternate implementation with full or reduced functionality.
- The advantage here is that the functional Lambda has no knowledge of the circuit breaker and it is handled/abstracted from it by the interceptor Lambda. This would mean that circuit breaker logic does not need to be coded within the functional Lambda.
Pattern 5: Lambda Synchronous Invocation and Circuit State Managed Between Cold Start and Warm Start
Circuit is closed:
- The API Gateway calls the Lambda function.
- Lambda function checks the status of the circuit (which is closed) from Global variables.
- As the circuit is closed, the Lambda function calls the external service.
Circuit switches to open:
- Now, let's say, there are issues with the external service.
- A new request comes in through the API Gateway, which then calls the Lambda function.
- Lambda function checks the status of the circuit (which is still closed) in Global variables.
- As the circuit is closed, the Lambda function calls the external service.
- The Lambda function encounters issues (failure/error/error code) with the external service. When the backend fails, the number of failures is captured in global variables along with the time of those failures. When a certain number of failures happens (configurable) within a defined time (configurable) then the circuit state is opened. The Global variables will have durations for the open circuit/half-open circuit. We will set the durations using the Time To Live.
- Circuit is now switched to an open state.
Circuit is open:
- A new request comes in through the API Gateway, which then calls the Lambda function.
- Lambda function checks the status of the circuit (which is open) from in Global variables.
- As the circuit is open, the Lambda function invokes fallback (or alternatively returns an error).
Circuit switches to half-open:
- The TTL of the Global variables (for the open circuit) setting expires.
- The TTL of the Global variables (for the half-open circuit) is still active.
- Circuit is now switched to a half-open state.
Circuit is half-open:
- A new request comes in through the API Gateway, which then calls the Lambda function.
- Lambda function checks the status of the circuit (which is half-open) from Global variables.
- The Lambda function checks if the invocation limit is reached as the circuit is half-open. The lambda function calls the fallback (or alternatively returns an error) if the invocation limit is reached. If the invocation limit is not reached, the lambda function decreases the invocation limit and then calls the external service.
- If the Lambda function identifies any issues (failure/error/error code) it will re-open the circuit.
Circuit switches to closed:
- Let's say the external service handled the requests as per the service objective.
- The TTL of the item (representing the half-open state) in Global variables expires.
- Circuit is now switched to a closed state.
- The Lambda function starts showing normal behavior expected when the circuit is closed.
When may this approach be applicable?
- This approach works for synchronous services.
- It is acceptable from a business standpoint to function with reduced functionality.
- There is a fallback function service, which can be an alternate implementation with full or reduced functionality.
- The advantage here is that there is no dependency on external circuit breaker state management (through DynamoDB) or requiring an external circuit breaker function. Everything can be managed inside the functional Lambda.
- The disadvantage is that the circuit breaker would only work for one invocation of the Lambda function that toggles from cold start to warm start.
- Each new cold start would start a new circuit breaker cycle with the assumption that the circuit is closed.
Pattern 6: Lambda Synchronous Invocation and Circuit State Managed Between Cold Start and Warm Start With Externalized Circuit State Management
This combines pattern 5 with pattern 6. Here circuit state would be managed by both within Lambda global variables and externally within DynamoDB. This would provide additional efficiencies where DynamoDB would be invoked only across Lambda cold starts and will not be invoked across Lambda warm starts.
Pattern 7: Lambda Asynchronous Invocation and Circuit State Managed by Invoking Lambda
In this option, let's consider Lambdas serving asynchronous requests through SQS. Here we will use the event source mapping, batch size, and batch window based on the status of the external service. Messages that could not be processed based on the business requirements will be sent to a dead letter queue.
Let's walk through how this option works.
Circuit is closed:
- The Lambda reads the messages from the SQS and calls the Lambda function.
- Lambda function calls the external service.
- The calls to the external service are observed using CloudWatch (based on errors and error rates).
Circuit switches to open:
- Now, let's say, there are issues with the external service:
- New messages come in through the SQS, which then calls the Lambda function.
- Lambda function calls the external service.
- The calls to the external service are observed using CloudWatch.
- CloudWatch identifies the issues (failure/error/error code) and raises alarm/event.
- The event source configuration in CloudWatch triggers the Circuit breaker Lambda.
- Circuit Breaker Lambda function creates an item in the DynamoDB, which will have a duration for the open circuit. We will set the duration using the item's Time To Live.
- The status of event source mapping is set to disabled.
- Circuit is now switched to an open state.
Circuit is open:
- As the circuit is open, the Lambda function will not be invoked when there are new messages in SQS.
Circuit switches to half-open:
- The TTL of the item (representing the open state) in DynamoDB expires.
- The TTL expiry will result in the DynamoDB stream triggering the Circuit Breaker Lambda.
- The Lambda function creates an item in the DynamoDB, which will have a duration for which the circuit is half-open. The Lambda function will use the SDK to enable the event source mapping and update the batch size and batch window for that specific duration. We set the duration using the item's Time To Live.
- Circuit is now switched to a half-open state.
Circuit is half-open:
- New messages come through the SQS, then Lambda, which then calls the Lambda function.
- The batch size and batch window are now as per the half-open setting.
- If the CloudWatch identifies any issues (failure/error/error code) and raises the alarm/event, which will result in the invocation of the Lambda to re-open the circuit.
- SQS moves the messages to the dead letter queue based on a pre-defined setting.
Circuit switches to closed:
- Let's say the external service handled the requests as per the service objective.
- The TTL of the item (representing the half-open state) in DynamoDB expires.
- The TTL expiry will result in the DynamoDB stream triggering the Circuit Breaker Lambda.
- The Lambda function can now increase batch size, and batch window values OR switch the circuit to Closed. Let's consider the design here where the circuit is closed.
- Circuit is now switched to the closed state.
- The Lambda function starts showing normal behavior expected when the circuit is closed.
We can use different approaches to set the open/half-open state time if issues are identified repeatedly. The duration can be based on exponential back-off or random jitter.
When may this approach be applicable?
- This approach works for asynchronous services.
- It is acceptable from a business standpoint to process messages at a lesser rate.
- Feasibility to implement event correlation to process each message only once (based on requirement).
Opinions expressed by DZone contributors are their own.
Comments