A complete troubleshooting record of JVM out of heap memory leakage

preface

Record the troubleshooting process and ideas of an online JVM out of heap memory leak, including some JVM memory allocation mechanisms and common JVM troubleshooting instructions and tools. I hope it will be helpful to you.

In the whole troubleshooting process, I have also taken many detours, but in the article, I will still write out the complete ideas and ideas as a lesson for future generations. Finally, the article also summarizes several principles for rapid troubleshooting of memory leakage.

The main contents of this paper are as follows:

  • Fault description and troubleshooting process
  • Failure cause and solution analysis
  • JVM heap memory and out of heap memory allocation principle
  • Introduction and use of common process memory leak troubleshooting instructions and tools

It's not easy to write this article. Please support my original technology. Official account: back end technology gossip

Fault description

During the noon break on August 12, our commercial service received an alarm. The physical memory (16G) occupied by the service process in the container exceeded the threshold of 80% and is still rising.

The monitoring system calls out the chart to view:

For example, the Java process has a memory leak, and our heap memory limit is 4G. This kind of memory larger than 4G is about to eat up. It should be a memory leak outside the JVM heap.

Confirm the startup configuration of the service process at that time:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80

Although there was no new code online that day, we were using the message queue to push the repair script of historical data that morning. This task will call a large number of one of our service interfaces, so it is preliminarily suspected that it is related to this interface.

The following figure shows the changes in the number of visits of the calling interface on that day:

It can be seen that the number of calls at the time of the crime was much higher than the normal situation (200 + times per minute) (5000 + times per minute).

We temporarily asked the script to stop sending messages. The number of calls to the interface decreased to 200 + times per minute. The container memory no longer increased at a very high slope, and everything seemed to return to normal.

Next, check whether there is a memory leak on this interface.

Troubleshooting process

First of all, let's review the memory allocation of Java processes to facilitate the following troubleshooting ideas.

Take our online jdk1 Take version 8 as an example. There are many on the JVM memory allocation network summary , I won't create again.

The JVM memory area is divided into two parts: heap area and non heap area.

  • Heap area: it is the old age of the Cenozoic generation that we are familiar with.
  • Non heap area: non heap area, as shown in the figure, has metadata area and direct memory.

Additional attention should be paid here: the permanent generation (native to JDK8) stores the classes used by the JVM runtime, and the objects of the permanent generation are garbage collected during full GC.

After reviewing the memory allocation of JVM, let's go back to the fault.

Heap memory analysis

Although it is basically confirmed that it has nothing to do with heap memory at the beginning, because the leaked memory occupation exceeds the heap memory limit of 4G, let's take a look at the clues of heap memory for the sake of insurance.

We observed the memory occupation curve and the statistics of recycling times of the new generation and the old generation. As usual, there was no big problem. Then we dump a log of JVM heap memory on the container at the scene of the accident.

Heap memory Dump

Heap memory snapshot dump command:

jmap -dump:live,format=b,file=xxxx.hprof pid

Voice over: you can also use jmap -histo:live pid to directly view the surviving objects in heap memory.

After downloading the file of Eclipse's Dump memory or exporting it to the local JVM, you can use the jdisk analyzer to download the file of the Dump.

Open the file with MAT, as shown in the figure:

You can see that there are some large objects related to nio in the heap memory, such as nioChannel receiving message queue messages and nio Heapbytebuffer, but the number is not large, so it can not be used as a basis for judgment. First put it and observe it.

Next, I begin to browse the interface code. The main logic of the interface is to call the WCS client of the group and write the data in the database table into WCS after looking up the table. There is no other additional logic

After discovering that there is no special logic, I began to doubt whether there is memory leakage in WCS client encapsulation. The reason for this doubt is that the underlying layer of WCS client is encapsulated by SCF client. As an RPC framework, its underlying communication transmission protocol may apply for direct memory.

Did my code start the Bug of WCS client, resulting in continuous application for direct memory calls, and finally eat up the memory.

I contacted the person on duty of WCS and described the problems we encountered with them. They replied to us and will perform the pressure test of write operation locally to see if we can reproduce our problems.

Since it will take time to wait for their feedback, we are ready to figure out the reasons ourselves.

I focused my suspicion on the direct memory. I suspected that the excessive number of interface calls and the improper use of nio by the client led to the use of ByteBuffer to apply for too much direct memory.

Voice over: the final result proved that this preconceived idea led to a detour in the investigation process. In the process of troubleshooting, it is possible to narrow the scope of troubleshooting with reasonable speculation, but it is best to list each possibility clearly first. When you find that you have no result in going deep into a certain possibility, you should go back in time and carefully examine other possibilities.

Sandbox environment reproduction

In order to restore the fault scene at that time, I applied for a pressure measuring machine in the sandbox environment to ensure consistency with the online environment.

First, let's simulate the memory overflow (a large number of calling interfaces):

We let the script continue to push data and call our interface. We continue to observe the memory occupation.

When the call is started, the memory starts to grow continuously and seems not to be limited (Full GC is not triggered due to the limit).

Next, let's simulate the normal volume adjustment (normal volume calling interface):

We cut the normal call volume of the interface (relatively small, and batch call every 10 minutes) to the pressure testing machine, and obtained the trend of old generation memory and physical memory as shown in the figure below:

The question is: why does memory keep going up and eat up memory?

At that time, it was speculated that the JVM process did not limit the direct memory size (- XX:MaxDirectMemorySize), so the out of heap memory kept rising and the FullGC operation would not be triggered.

Two conclusions can be drawn from the above figure:

  • When there are a large number of interface calls with memory leakage, if the old generation in the heap and other conditions do not meet the FullGC condition, it will not be FullGC and the direct memory will rise all the way.
  • In the case of low-key consumption at ordinary times, the memory leakage is relatively slow. FullGC will always come and recover the leaked part, which is also the reason why there is no problem at ordinary times and it has been running normally for a long time.

As mentioned above, there is no limit on direct memory in the startup parameters of our process, so we added - XX:MaxDirectMemorySize configuration and tested it in the sandbox environment again.

It is found that the physical memory occupied by the process will continue to rise, exceeding the limit set by us, and the "look" configuration does not seem to work.

This surprised me. Is there a problem with the JVM's memory limit?

Here, I can see that my thinking in the process of troubleshooting is persistent in the leakage of direct memory, which is gone forever.

Voice over: we should trust the JVM's mastery of memory. If we find that the parameters are invalid, we should find out the reasons from ourselves to see if we use the parameters incorrectly.

Direct memory analysis

In order to further investigate what is in the direct memory, I began to work on the direct memory. Because direct memory is not the same as heap memory, it is easy to see all the occupied objects. We need some commands to check the direct memory. I have used several methods to check the problems in the direct memory.

View process memory information pmap

PMAP - report memory map of a process

pmap command is used to report the memory mapping relationship of processes. It is a good tool for Linux debugging and operation and maintenance.

pmap -x pid If sorting is required  | sort -n -k3**

After execution, I get the following output. The deleted output is as follows:

..
00007fa2d4000000    8660    8660    8660 rw---   [ anon ]
00007fa65f12a000    8664    8664    8664 rw---   [ anon ]
00007fa610000000    9840    9832    9832 rw---   [ anon ]
00007fa5f75ff000   10244   10244   10244 rw---   [ anon ]
00007fa6005fe000   59400   10276   10276 rw---   [ anon ]
00007fa3f8000000   10468   10468   10468 rw---   [ anon ]
00007fa60c000000   10480   10480   10480 rw---   [ anon ]
00007fa614000000   10724   10696   10696 rw---   [ anon ]
00007fa6e1c59000   13048   11228       0 r-x-- libjvm.so
00007fa604000000   12140   12016   12016 rw---   [ anon ]
00007fa654000000   13316   13096   13096 rw---   [ anon ]
00007fa618000000   16888   16748   16748 rw---   [ anon ]
00007fa624000000   37504   18756   18756 rw---   [ anon ]
00007fa62c000000   53220   22368   22368 rw---   [ anon ]
00007fa630000000   25128   23648   23648 rw---   [ anon ]
00007fa63c000000   28044   24300   24300 rw---   [ anon ]
00007fa61c000000   42376   27348   27348 rw---   [ anon ]
00007fa628000000   29692   27388   27388 rw---   [ anon ]
00007fa640000000   28016   28016   28016 rw---   [ anon ]
00007fa620000000   28228   28216   28216 rw---   [ anon ]
00007fa634000000   36096   30024   30024 rw---   [ anon ]
00007fa638000000   65516   40128   40128 rw---   [ anon ]
00007fa478000000   46280   46240   46240 rw---   [ anon ]
0000000000f7e000   47980   47856   47856 rw---   [ anon ]
00007fa67ccf0000   52288   51264   51264 rw---   [ anon ]
00007fa6dc000000   65512   63264   63264 rw---   [ anon ]
00007fa6cd000000   71296   68916   68916 rwx--   [ anon ]
00000006c0000000 4359360 2735484 2735484 rw---   [ anon ]

It can be seen that the bottom line is the mapping of heap memory, which occupies 4G. There are many small memory occupation on other lines, but we still can't see the problem through this information.

Out of heap memory tracking NativeMemoryTracking

Native Memory Tracking (NMT) is a function used by Hotspot VM to analyze the internal memory usage of VM. We can use jcmd (jdk's own tool) to access NMT data.

NMT must be opened through VM startup parameters first, but it should be noted that opening NMT will bring 5% - 10% performance loss.

-XX:NativeMemoryTracking=[off | summary | detail]
# Off: off by default
# summary: only the memory usage of each category is counted
# detail: Collect memory usage by individual call sites.

Then run the process, and you can use the following command to view the direct memory:

jcmd <pid> VM.native_memory [summary | detail | baseline | summary.diff | detail.diff | shutdown] [scale= KB | MB | GB]
 
# summary: classification memory usage
# detail: detailed memory usage, including virtual memory usage in addition to summary information.
# baseline: create a snapshot of memory usage for comparison
# summary.diff: comparison with the previous baseline summary
# detail.diff: compare with the last baseline detail
# Shutdown: shutdown NMT

We use:

jcmd pid VM.native_memory detail scale=MB > temp.txt

The results are shown in the figure below:

The information given to us in the figure above can't clearly see the problem. At least I still can't see the problem through these information at that time.

The investigation seems to be at an impasse.

There is no way out of doubt

When the investigation stalled, we got a reply from WCS and SCF. Both parties confirmed that there was no memory leakage in their encapsulation. WCS did not use direct memory. Although SCF is the underlying RPC Protocol, it will not leave such obvious internal bug s. Otherwise, there should be a lot of feedback online.

View JVM memory information jmap

At this time, if I can't find the problem, I open a new Sandbox container again, run the service process, and then run the jmap command to see the actual configuration of JVM memory:

jmap -heap pid

Results obtained:

Attaching to process ID 1474, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.66-b17

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 4294967296 (4096.0MB)
   NewSize                  = 2147483648 (2048.0MB)
   MaxNewSize               = 2147483648 (2048.0MB)
   OldSize                  = 2147483648 (2048.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 1932787712 (1843.25MB)
   used     = 1698208480 (1619.5378112792969MB)
   free     = 234579232 (223.71218872070312MB)
   87.86316621615607% used
Eden Space:
   capacity = 1718091776 (1638.5MB)
   used     = 1690833680 (1612.504653930664MB)
   free     = 27258096 (25.995346069335938MB)
   98.41346682518548% used
From Space:
   capacity = 214695936 (204.75MB)
   used     = 7374800 (7.0331573486328125MB)
   free     = 207321136 (197.7168426513672MB)
   3.4349974840697497% used
To Space:
   capacity = 214695936 (204.75MB)
   used     = 0 (0.0MB)
   free     = 214695936 (204.75MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 2147483648 (2048.0MB)
   used     = 322602776 (307.6579818725586MB)
   free     = 1824880872 (1740.3420181274414MB)
   15.022362396121025% used

29425 interned Strings occupying 3202824 bytes

From the output information, it can be seen that the old and new generations are quite normal, the meta space only occupies 20M, and the direct memory seems to be 2g

Huh? Why MaxMetaspaceSize = 17592186044415 MB? It looks like there's no limit.

Take a closer look at our startup parameters:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80

The configuration is - XX:PermSize=256m -XX:MaxPermSize=512m, which is the memory space of permanent generation. After 1.8, the permanent generation of Hotspot virtual machine has been removed and replaced by meta space. Because we use jdk1 online 8. Therefore, we have no limit on the maximum capacity of meta space at all, - XX:PermSize=256m -XX:MaxPermSize=512m are expired parameters for 1.8.

The following figure describes the change of permanent generation from 1.7 to 1.8:

Is it possible that the meta space memory is leaked?

I chose to test locally, which is convenient to change parameters and use the JVisualVM tool to visually see the memory changes.

Use JVisualVM to observe the process running

First, limit the meta space, use the parameters - XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=128m, and then call the problematic interface in a local loop.

As shown in the figure:

It can be seen that when the meta space is exhausted, the system starts Full GC, the meta space memory is recycled, and many classes are unloaded.

Then we remove the meta space restriction, that is, we use the parameters that went wrong before:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 -XX:MaxDirectMemorySize=2g -XX:+UnlockDiagnosticVMOptions

As shown in the figure:

It can be seen that the meta space is rising, and the loaded classes are also rising with the increase of adjustment amount, showing a positive correlation trend.

Another village with a bright future

The problem suddenly became clear. With each interface call, it is very likely that a class is constantly created, occupying the memory of meta space.

Observe JVM class loading - verbose

When debugging a program, you sometimes need to check the class loaded by the program, memory recycling, local interface called, etc. At this time, the - verbose command is required. In myeclipse, you can right-click to set it (as follows), or enter Java verbose on the command line to view it.

-verbose:class View class loading
-verbose:gc View memory reclamation in virtual machine
-verbose:jni View local method calls

In the local environment, we add the startup parameter - verbose:class to call the interface circularly.

You can see that countless com alibaba. fastjson. serializer. ASMSerializer_ 1_ WlkCustomerDto:

[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]

When we call many times and accumulate certain classes, we manually execute Full GC to recycle the class loader. We find that a large number of fastjson related classes are recycled.

If you use jmap to check the class loading before recycling, you can also find a large number of fastjson related classes:

jmap -clstats 7984

Now we have a direction. This time, we carefully check the code, check where fastjson is used in the code logic, and find the following code:

/**
 * Returns the Json string Hump turn_
 * @param bean Entity class
 */
public static String buildData(Object bean) {
    try {
        SerializeConfig CONFIG = new SerializeConfig();
        CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
        return jsonString = JSON.toJSONString(bean, CONFIG);
    } catch (Exception e) {
        return null;
    }
}

Root cause of problem

Before calling wcs, we serialize the entity class of the hump field into an underlined field, * * this requires the SerializeConfig of fastjson, which we instantiated in the static method. When SerializeConfig is created, an ASM proxy class will be created by default to serialize the target object. That is, the above frequently created class com alibaba. fastjson. serializer. ASMSerializer_ 1_ WlkCustomerDto, if we reuse SerializeConfig, fastjson will look for the created proxy class and reuse it. However, if new SerializeConfig(), the original generated proxy class cannot be found, and a new WlkCustomerDto proxy class will be generated all the time.

The following two figures are the source code of problem location:

We take SerializeConfig as the static variable of the class, and the problem is solved.

private static final SerializeConfig CONFIG = new SerializeConfig();

static {
    CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
}

What does fastjson SerializeConfig do

SerializeConfig introduction:

The main function of SerializeConfig is to configure and record the serialization class corresponding to each Java type (the implementation class of ObjectSerializer interface), such as Boolean Class uses Boolean codec (it can be seen from the name that this class writes the serialization and deserialization implementation together) as the serialization implementation class, float [] Class uses FloatArraySerializer as the serialization implementation class. Some of these serialization implementation classes are implemented by default in FastJSON (such as java basic classes), some are generated through ASM framework (such as user-defined classes), and some are even user-defined serialization classes (such as Date type framework, which is converted to milliseconds by default, and applications need to be converted to seconds). Of course, this involves the question of whether to use ASM to generate serialization classes or JavaBean serialization classes. The judgment here is based on whether the Android environment (the environment variable "java.vm.name" is "dalvik" or "lemur" is the Android environment), but the judgment is not only here, but also more specific judgment in the future.

Theoretically, if each SerializeConfig instance serializes the same class, it will find the proxy class of the class generated before to serialize. Every time the interface is called, our services instantiate a ParseConfig object to configure the setting of Fastjson inverse sequence. Without disabling the ASM agent, because every call to ParseConfig is a new instance, the created proxy class will never be checked. Therefore, Fastjson will continue to create new proxy classes and load them into metaspace, which eventually leads to the continuous expansion of metaspace, Run out of memory on the machine.

Upgrade jdk1.0 8 will cause problems

The cause of the problem is still worthy of attention. Why won't this problem occur before upgrading? This requires the analysis of jdk1 8 and 1.7 are different from their own hotspot virtual machines.

From jdk1 Since August 8, the built-in hostspot virtual machine has cancelled the past permanent area and added a metaspace area. From the perspective of function, metaspace can be considered to be similar to the permanent area. Its main function is to store class metadata, but the actual mechanism is quite different.

Firstly, the default maximum value of metaspace is the physical memory size of the whole machine, so the continuous expansion of metaspace will lead to java programs occupying the available memory of the system, and finally the system has no available memory; The memory area of the whole machine will not be permanently expanded to the default size. When the allocated memory is exhausted, both of them will trigger full gc, but the difference is that when the persistent area is in full gc, the Class metadata (Class object) in the persistent area can be recycled with a similar mechanism as when the heap memory is recycled. As long as it is an object that cannot be reached by the root reference, metaspace judges whether the Class metadata can be recycled according to whether the Classloader that loads these Class metadata can be recycled, As long as the Classloader cannot be recycled, the Class metadata loaded through it will not be recycled. This explains why our two services didn't have problems until they were upgraded to 1.8. In the previous jdk version, although many proxy classes were created every time fastjson was called and many Class instances of proxy classes were loaded in the persistent area, these Class instances were created in the method call and could not be reached after the call was completed. Therefore, when the memory of the persistent area was full and full gc was triggered, they would be recycled.

When using 1.8, because these proxy classes are loaded through the Classloader of the main thread, the Classloader will never be recycled during the running process of the program. Therefore, these proxy classes loaded through it will never be recycled, which leads to the continuous expansion of metaspace and the final depletion of the machine's memory.

This problem is not limited to fastjson, as long as it is necessary to create classes through program loading. Especially in the framework, tools like ASM and javassist are often widely used for bytecode enhancement. According to the above analysis, in jdk1 Before 8, because in most cases dynamically loaded classes can be recycled during full gc, it is not easy to have problems. Therefore, many frameworks and toolkits did not deal with this problem. Once upgraded to 1.8, these problems may be exposed.

summary

The problem was solved. Next, the whole troubleshooting process of the second disk revealed a lot of problems. The most important thing is that I am not familiar with the memory allocation of different versions of the JVM, resulting in the wrong judgment of the older generation and meta space. I took a lot of detours and searched in the direct memory for a long time, wasting a lot of time.

Secondly, the investigation needs to be careful and comprehensive. It's best to sort out all possibilities first, otherwise it's easy to fall into the scope of investigation set by yourself and stay in a dead end.

Finally, I would like to summarize the achievements of this problem:

  • JDK1. Since August 8, the built-in hostspot virtual machine has cancelled the past permanent area and added a metaspace area. From the perspective of function, metaspace can be considered to be similar to the permanent area. Its main function is to store class metadata, but the actual mechanism is quite different.
  • The memory in the JVM needs to be limited at startup, including the familiar heap memory, as well as the direct memory and meta generation area, which is the last clue to ensure the normal operation of online services.
  • When using class libraries, please pay more attention to the writing of code and try not to have obvious memory leakage.
  • For class libraries that use bytecode enhancement tools such as ASM, please be more careful when using them (especially after JDK1.8).

It's not easy to write this article. Please support my original technology. Official account: back end technology gossip

reference resources

Observe the process of class loading when the program is running

blog.csdn.net/tenderhearted/article/details/39642275

Overall introduction to Metaspace (the reason why the permanent generation is replaced, the characteristics of the meta space, and the analysis method of checking the meta space memory)

https://www.cnblogs.com/duanxz/p/3520829.html

Common troubleshooting process for java memory usage exceptions (including out of heap memory exceptions)

https://my.oschina.net/haitaohu/blog/3024843

Full interpretation of out of heap memory in JVM source code analysis

http://lovestblog.cn/blog/2015/05/12/direct-buffer/

Uninstallation of JVM classes

https://www.cnblogs.com/caoxb/p/12735525.html

fastjson in jdk1 8 open asm above

https://github.com/alibaba/fastjson/issues/385

fastjson: PropertyNamingStrategy_cn

https://github.com/alibaba/fastjson/wiki/PropertyNamingStrategy_cn

Beware of Metaspace memory leakage caused by dynamic agents

https://blog.csdn.net/xyghehehehe/article/details/78820135

Pay attention to me

I am a back-end development engineer. Mainly focus on back-end development, data security, crawler, Internet of things, edge computing and other directions. Welcome to communicate.

I can be found on all major platforms

Main content of original article

  • Back end development practice
  • Java interview knowledge
  • Design pattern / data structure / algorithm problem solving
  • Reading notes / anecdotes / procedural life

Personal official account: talk about back-end technology

If the article is helpful to you, you might as well like it and collect it~

Tags: Java

Posted by thefisherman on Thu, 19 May 2022 23:13:51 +0300