Fighting Java Memory Leak in Production Systems
Want to learn more about how to fight Java memory leaks in your production system? Click here to learn more about fighting memory leaks without all your fancy tools!
Join the DZone community and get the full member experience.
Join For FreeMemory leaks in Java are typical, and we have some fancy tools to fight them. Jprofiler, VisualVM, and JMC are some of the popular ones for Java. These tools provide us with sophisticated ways to solve the mysteries with some nice-looking UIs. But as Java developers, if we are tasked to solve the memory leaks without having to utilize any of the fancy tools, we are in for some trouble. There can be some real situations in which we are not allowed to use our typical weapons. Specifically, in production systems, we might not get permission to connect to one of these tools, mostly owing to security concerns. Moreover, our clients would raise a question about why the same scenario cannot be reproduced in the local testbed environments. But some of the memory leaks are hard to reproduce in local environments.
Production-only memory leaks can be special, which may not happen with our usual ways of testing. Anyway, a well-planned and thorough test will surely help to reduce the possibility of memory leaks, but it is difficult to achieve. Let's take a closer look.
Reasons for Memory Leaks in Production Systems
1. Unexpected External Party Behaviors
In production, software systems may be integrated with several external third-party systems. Those third-parties can behave in some unexpected ways. Behaviors might cause our side to contribute to the memory leak. For example, if third-party applications are expected to use some persistent connections to fire requests, they might be connecting each time to send a single request. This can be a possible unexpected behavior situation. If we cache that persistent connection information in the memory, we will be accumulating some information that can fill the memory for no reason.
2. Longer Uptime
Production systems are usually expected to run for longer times without giving any restarts. During these longer runs, it can accumulate tiny bits of memory leak, which will not look significant during the local runs. These kinds of memory leaks are very hard to reproduce in the Testbed, as their accumulation rate is pretty small so that we need to keep the system running with a load for a longer time. That would require us to run with a higher load for 2-3 days continuously, then we can reproduce a possible memory leak, which might take 2-3 months to show weird signs in the production systems. While you try to reproduce it, you have to take "jmap histograms" and compare them from time to time. Slowly accumulating memory leak objects will climb the ladder slowly and come to the top of the histograms. So, we need continuous monitoring while we are trying to reproduce.
3. Unexpected Outages in Production
In a big complicated system, you assume most of the external services are up and running 24/7. But there can be database outages or third-party servers that can go down for a period of time. When testing in the local environment, we rarely consider these failures. Moreover, we might have too many third-party servers ending up creating so many possibilities of failures. These memory leaks might be caused only when the system is trying to reconnect. Therefore, third-party systems outages might contribute to a memory leak, which can grow over time. To solve this, we might need to inspect our production system's error logs to study the production situation. We need to come up with a list of errors and exceptions. Later, we need to configure the local setup to reproduce the same kind of errors one by one. We need to take jmap histograms with some time gaps and compare each other and see whether those errors are actually causing the memory leak. Then, we can try out each of the errors and find the correct one that is actually causing the memory leak.
How Can We Hunt for the Memory Leak in Production?
As we discussed earlier, production systems are not allowed to connect with any typical remote profiling systems. So, we are only left with traditional JDK-based tools to deal with. In this article, we will use the jmap histogram command primarily. Let's see how a jmap command can be used.
1) Take the memory dump using the following command:
jmap -histo:live <process-id>
Here, it is important to go with ":live," as it doesn't count the objects that are not referred by anyone. Anyway, those un-referred objects will be eliminated in the next possible GC run. If more heap is filled up, the ":live" option might pause the application threads for a longer time, resulting in pausing the system for a considerable amount of time. It is always better to execute the histogram command during the off-peak time, as this command can slow down the system's progress and your customers can experience longer delays.
2) Time gap between the jmap-histos:
Take these recording with an appropriate time gap to learn the memory accumulating pattern. Later, these memory histograms can be compared to see if there is an increase in certain objects or check the total used heap memory in the bottom of the histogram. If we are receiving requests at a fast rate per second, then we can take the histogram with a few seconds gap. But, we need to leave a few days gap if the accumulating rate in the system is slower. Therefore, we need to decide the time gap based on the situation.
3) Compare the memory usage of certain local objects
Compared to the histogram, we can identify the memory leak objects. It is better to start suspecting our own objects first. So, you can start with comparing the number of objects among the multiple histos. If there are any suspected objects found, we can check the code to see whether it actually causing the issue. A careful comparison of each object will help to identify the actual culprit objects.
Improvisations to Make the System More Traceable
As per the above statements, a memory leak in production is something that is hard to avoid in large systems. So, how can we change our systems to be more traceable to identify the memory leaks faster in the future?
1) Better to Monitor the Production Servers Time to Time During the Early Stages
It is always good to be cautious with your system during the early stages of its deployment. You can request your System Support Engineers to get a memory footprint of your system using the "jmap -histo" once a week during the off-peak time and compare the memory growth. If there is a growth, the total used memory will give an early indication of a possible memory leak. Some of the memory leaks takes more than few months to hit the OutOfMemoryError
. But once a system hits this error, we are not left with any option other than the system restart. If we have taken some memory-histos, we may be in a better position to predict the possible failure or we will have enough facts to solve it. If we fail to take these memory-histos, we may have to wait for weeks or months to monitor and understand the reason for the memory leak.
2) Introduce a Seconds-Based System Progress Summary Log
It is good to keep track of the system's proceeding per second basis to understand the health of the system. The following fields can be covered in that log. This log can be printed per second. It needs to print in a separate file with a shortened format.
i) Number of requests that comes into the system
ii) How many requests are fired to the other end and how many of them are successful
iii) How requests are failing timeout waiting for the response from the other end
For example:
Rq-I[300] Rq-A[240] Rq-Rj[60] Rq-Fw[235] Rs-Rc[230] Rs-To[5]
As shown in the above example, here are the detailed explanations for the shorthand:
Rq-I
— Number of requests received into the system
Rq-A
— Number of requests accepted into the system
Rq-Rj
— Number of requests, which were rejected during initial validation
These progress log always gives you an indication of the current status of your system. Especially when there is a weird situation happens due to the external entity, it records the number of such failures per second. If there is a connection loss or a third-party system is down, incoming requests would fail due to the dependency on those third-party systems. There is a chance that these failures can lead to a memory leak. Even though this log would not entirely help define the situation all the time, it will support the investigations to come up with a possible reason for the memory leak.
3) Enable GC Logs
Make sure to enable the GC log, and it should print the date/time on each of the entries. These GC logs can tell us whether the memory leak has a smooth increment or if it starts suddenly due to some external reason.
4) Print Proper Error Logs for Exceptional Cases
These error logs will give us an idea of how production systems are experiencing failures and error situations. To see whether errors are causing the memory leak, we can compare the number of objects in the memory with the number of such error messages.
I just shared my findings from my own experience. I would love to hear more thought on this subject in the comments below!
Opinions expressed by DZone contributors are their own.
Comments