Application Self-Healing: Common Failures and How to Avoid Them

In this article, become familiarized with self-healing and review some common features when applications experience failures.

Alireza C

CORE ·

Jan. 29, 22 · Opinion

Likes (4)

Comment

Save

4.3K Views

This is an article from DZone's 2021 Application Performance Management Trend Report.

For more:

Read the Report

Today, automation is one of the major goals in IT projects. Most platforms are running on a cloud infrastructure and fully automated at both the platform and infrastructure levels. Companies are moving forward with automation and extending it to disaster recovery. As a result, many applications are designed in such a way to avoid failures and recover automatically. This is often called "self-healing." In this article, we will familiarize readers with self-healing and review some common features when applications experience failures.

What Is a Self-Healing Application?

A self-healing application is an application that detects a failure and tries to restore the situation before it escalates into a larger issue. For users, self-healing applications reduce system downtime. For developers, self-healing allows them to spend more time on development instead of fixing issues. When something fails, a self-healing application keeps on running, replicating the application. In short, it tries to restore the application to its default state. The primary tasks of a self-healing feature are to detect and repair problems. A self-healing application can automatically detect failures and detect system errors and stop the system automatically. For the purposes of this article, when we say application, we mean the whole system to which the application or platform belongs.

Self-Healing Levels

Each application or platform is not just the developed code. The hardware in which you run your code on and the connecting third parties are also part of your application. It is quite common for applications to depend on several third parties. On one hand, it is easier to focus on your main application’s functionality and use third parties for other services, but on the other hand, if a third party fails, that can lead to your application failing.

Then, it is on you to take action and fix the issue. Moreover, when running your application on the cloud, you are dealing with virtual infrastructure and responsible for your infrastructure disaster recovery strategy and setup. Although all cloud providers give an SLA time and promise to keep all services up and running at the highest level they can, it is up to you to ensure your application is always accessible no matter where the failure is.

Before thinking about how to create a self-healing mechanism, you need to identify the points of failure. To design a self-healing system, you need to have a holistic monitoring overview of your application. Make sure nothing is left out of your radar. Then, you can define the possible scenarios and act accordingly to keep your application up and running all the time. To get a better picture of what needs to be done on the monitoring side, let’s break the monitoring into some sub-areas.

Observability Level

Monitoring is one of the most important parts of any application. The monitoring solution gives an observation of application behavior during runtime. Additionally, we can check infrastructure performance, network transactions, and third-party availability. The monitoring setup is not only about the application itself, but covers everything related to the application, from infrastructure to application components and third parties.

Smart Alerts

When setting up alerts, engineers have to specify warning and critical thresholds for each alert. However, people often begin to see notifications or emails about it as soon as alerts are set up. These notifications do not necessarily mean that there is a failure — rather, they might say that the system has passed a threshold. After a while, most engineers tend to ignore these alert notifications if they seem unimportant. That can lead to real issues being missed among many false-positive alerts. The correct approach with monitoring is to fine-tune the alerts so that each one is something you should take seriously and act on. If you have alerts that are not important, they should either be tuned or removed.

Log Everything

Logging is not necessarily part of monitoring, but what makes it important is the data you collect. With logging, you can record all events with the exact time and date. When something fails, logs are golden information you have that tells you what happened, when, and where. That’s why it is vital to log everything to track all possible reasons behind failures. It is recommended to centralize the logging into your monitoring system.

In most monitoring tools, you can connect the logging system to the monitoring setup, and the monitoring system will process the logging data. Smart monitoring systems can identify the relationship between application components, hardware, and third parties. Therefore, they can create a summary out of monitoring and logging data at the time of failure, which helps find the root cause faster.

Common Failure Areas

These are some of the common areas in which we experience application failures. Let’s look at each failure area separately and the associated solution for each.

Loss of Network Connectivity

One of the most common failures is losing network connection. Connection loss can happen for the entire application or even inside an application between components, like a database connection drop. The best approach to self-heal these failures is to create a retry mechanism that increases the chance of recovery in a short period. A good monitoring tool comes in handy to help resolve these issues. You can easily trigger a retry operation from the monitoring alert. Therefore, you can record an incident, as well as resolve it, automatically.

Lack of Scalability

When the number of requests on an application is higher than it can handle, the application starts to fail or cannot handle the requests. The solution for this is to make the application scalable. Scaling is something that can be designed and handled in different stages of application development. The best place to think about scalability is at architecture time, when you design your application with all components. You can choose technologies that cover scalability in an automated manner. One example is using container-based architectures and tools like Kubernetes, which handle scalability at different levels.

Long-Running Transactions

Failures happen for long-running transactions, and after each failure, the transaction should start from the beginning. To keep the resiliency of these transactions, you can create checkpoints that help understand at which stage the failure occurred. Then, the system can start the transaction and continue from where it left off.

Instance Failure

If an instance of an application cannot be reached, the only solution is to have another instance failover. This should be considered at the design stage, and instances should be added or removed based on need. So, if the instance is a database, that can be replicated to other instances to failover. If the instance is an application, you can use a load balancer or any traffic distributor service and add instances behind it. Currently, all cloud providers are supporting this feature as high availability. So this has to be configured at the same time that infrastructure is created.

Overwhelmed APIs

Sometimes, sudden spikes in traffic can lead to high pressure on APIs, enabling applications to process requests properly. This can be prevented by using a queue to take jobs asynchronously.

Bottom Line

First, have a good architecture to assure that you cover all possible scenarios related to workload, scalability, resiliency, and high application availability. This includes the application, infrastructure, and third parties. The next thing is to establish a good monitoring system that identifies anomalies. AIOps solutions are great options to cover all monitoring requirements. These solutions can process application behavior and find anomalies that are outside of the basic monitoring radar. Do not miss adding logging to your monitoring to record all events.

Last but not least is to test your application and infrastructure to make sure they are continuously in a good configuration for the current workload. You could consider conducting load tests from time to time to simulate a higher workload to measure this. This is a continuous activity that helps keep your application running and reliable.