JVM Performance Tuning and Off-Heap Memory Leak Troubleshooting in Practice

January 16, 2018 Java Jvm, Gc, Performance, Debugging 2204 words 11 min read

Introduction

JVM performance tuning and memory issue troubleshooting are common challenges in Java development. Starting from the three core goals of GC tuning — latency, throughput, and capacity — this article combines them with an off-heap memory leak investigation case to outline a systematic tuning approach and troubleshooting methods.

Off-heap memory leaks occur outside the JVM heap and are hard to detect with conventional GC monitoring tools, often causing systems to crash suddenly in production. This article documents a complete off-heap memory leak investigation process.

Three Goals of GC Tuning: Latency, Throughput, Capacity

Core Concepts

GC tuning, like other performance optimization, requires following a scientific methodology. Before starting tuning, we need to clarify three core performance goals:

1. Latency

Latency refers to the upper bound requirement for GC pause times. Such goals typically come from business requirements:

All user transactions must receive a response within 10 seconds
90% of order payment operations must be processed within 3 seconds
Recommended products must be displayed to users within 100ms

With these performance metrics, you need to ensure that GC pauses do not exceed the latency requirements during transactions.

2. Throughput

Throughput refers to the amount of work a system can complete per unit of time. Common metrics include:

Requests processed per second
Operations completed per hour
System resource utilization

Throughput focuses on the system’s overall performance rather than just individual operation response times.

3. Capacity

Capacity tuning is more about cost considerations — minimizing hardware configuration while meeting latency and throughput requirements, optimizing system resource utilization efficiency.

Factory Assembly Line Analogy

To better understand these three concepts, we can use a factory assembly line as an analogy:

Latency: From the first bicycle frame part entering the assembly line to the finished bicycle leaving the line takes a total of 4 hours
Throughput: One bicycle leaves the assembly line every minute, producing 60 bicycles per hour
Capacity: The maximum production capacity the assembly line can support, which can be increased by adding more assembly lines

This analogy tells us that tuning can involve hardware upgrades (increasing capacity) or software optimization (reducing latency) — the right approach depends on actual requirements.

Tuning Parameters in Detail with Experimental Data

Basic Example Program

To demonstrate the practical effects of GC tuning, let’s look at a sample program:

java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class Producer implements Runnable {
  private static ScheduledExecutorService executorService = Executors.newScheduledThreadPool(2);
  private Deque<byte[]> deque;
  private int objectSize;
  private int queueSize;

  public Producer(int objectSize, int ttl) {
    this.deque = new ArrayDeque<byte[]>();
    this.objectSize = objectSize;
    this.queueSize = ttl * 1000;
  }

  @Override
  public void run() {
    for (int i = 0; i < 100; i++) { 
      deque.add(new byte[objectSize]); 
      if (deque.size() > queueSize) {
        deque.poll();
      }
    }
  }

  public static void main(String[] args) throws InterruptedException {
    executorService.scheduleAtFixedRate(new Producer(200 * 1024 * 1024 / 1000, 5), 0, 100, TimeUnit.MILLISECONDS);
    executorService.scheduleAtFixedRate(new Producer(50 * 1024 * 1024 / 1000, 120), 0, 100, TimeUnit.MILLISECONDS);
    TimeUnit.MINUTES.sleep(10);
    executorService.shutdownNow();
  }
}

This program submits two tasks every 100 milliseconds, simulating different object lifecycles.

GC Log Configuration

When running the above program, you can enable GC logging with the following JVM parameters:

1
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

Typical GC log output:

1
2
3
2015-06-04T13:34:16.119-0200: 1.723: [GC (Allocation Failure) [PSYoungGen: 114016K->73191K(234496K)] 421540K->421269K(745984K), 0.0858176 secs] [Times: user=0.04 sys=0.06, real=0.09 secs] 
2015-06-04T13:34:16.738-0200: 2.342: [GC (Allocation Failure) [PSYoungGen: 234462K->93677K(254976K)] 582540K->593275K(766464K), 0.2357086 secs] [Times: user=0.11 sys=0.14, real=0.24 secs] 
2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K->70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs]

Experimental Configuration Comparison

We ran the same code with three different configurations and obtained different results:

Heap Size	GC Algorithm	Effective Work Ratio	Max Pause Time
-Xmx12g	-XX:+UseConcMarkSweepGC	89.8%	560 ms
-Xmx12g	-XX:+UseParallelGC	91.5%	1,104 ms
-Xmx8g	-XX:+UseConcMarkSweepGC	66.3%	1,610 ms

Latency Tuning

Assume the requirement: each task must be processed within 1000ms. The actual task processing takes 100ms, so GC pauses cannot exceed 900ms.

From the experimental results, the ConcMarkSweepGC configuration meets this requirement:

1
java -Xmx12g -XX:+UseConcMarkSweepGC Producer

The corresponding GC log shows a maximum pause time of 560ms, meeting the 900ms latency target.

Throughput Tuning

Assume a throughput target of 13 million operation processes per hour.

Analyzing the experimental data, the ParallelGC configuration meets the requirement:

1
java -Xmx12g -XX:+UseParallelGC Producer

The effective work ratio is 91.5%, calculated as:

1
10 * 60 * 1000 * 91.5% = 5,490,000 ms = 5490s

Capacity Tuning

While meeting latency and throughput requirements, we can try reducing hardware configuration. From the experimental data, the 8GB memory configuration meets latency requirements but has only 66.3% effective work ratio, indicating insufficient hardware resources.

In Practice: A Complete Off-Heap Memory Leak Investigation

Symptom Discovery

A system that had been running stably in production for three years was migrated from physical machines to a Docker environment. After running for a while, the monitoring system suddenly issued alerts for unavailable instances. The load balancer automatically removed the failed nodes.

1
-Xmx1792m -Xms1792m -Xmn900m -XX:PermSize=256m -XX:MaxPermSize=256m -server -Xss512k

Checking OS monitoring revealed abnormal memory usage:

The blue line shows total memory usage, rising continuously to 4G before exceeding system limits
The maximum heap memory was set to 1792M — clearly an off-heap memory leak

Emergency Measures

Urgently restarted the application instances. After restart, memory usage was normal and everything appeared fine.

Initial Investigation

GC Log Analysis

First, we examined the GC logs and found that memory consistently dropped back to around 170M with no significant increase. Knowing that the JVM process itself was using nearly 4G of memory, this further confirmed off-heap memory as the cause.

Code Investigation

Examining the production service code, we found:

No explicit use of off-heap memory
No dependencies on additional native methods
Network I/O code was managed by Tomcat, which was unlikely to have off-heap memory leaks

Deep Investigation

JVM Heap Dump

Since the problematic server in production had already been killed, fortunately there were several other machines. We found they also had significant off-heap memory usage, just hadn’t reached the OOM threshold yet.

Using jmap to dump the JVM heap:

1
jmap -dump:format=b,file=heap.bin [pid]

MAT Analysis

Using MAT to analyze the heap file, the heap usage showed a total of just over 200M — consistent with the 170M reported in the GC logs, far below the 4G level.

MAT indicated a potential memory leak point: the CachedBnsClient class had 12,452 instances, accounting for 61.92% of the entire heap.

Code Review

Most calls to CachedBnsClient in the system were through @Autowired annotations, and these instances were few. The only code that frequently created such instances was:

java
1
2
3
4
5
6
@Override
public void fun() {
    BnsClient bnsClient = new CachedBnsClient();
    // do something
    return ;
}

Examining the CachedBnsClient class:

java
1
2
3
4
5
6
public class CachedBnsClient {
    private ConcurrentHashMap<String, List<String>> authCache = new ConcurrentHashMap<String, List<String>>();
    private ConcurrentHashMap<String, List<URI>> validUriCache = new ConcurrentHashMap<String, List<URI>>();
    private ConcurrentHashMap<String, List<URI>> uriCache = new ConcurrentHashMap<String, List<URI>>();
    ......
}

Nothing appeared to suggest a memory leak.

Thread Information Analysis

Using jstack to dump thread information revealed that the more thread data available, the more clues could be found. In addition to normal I/O threads and framework daemon threads, there were an astonishing 12,563 extra threads:

1
2
3
4
"Thread-5" daemon prio=10 tid=0x00007fb79426e000 nid=0x7346 waiting on condition [0x00007fb7b5678000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at com.xxxxx.CachedBnsClient$1.run(CachedBnsClient.java:62)

And these were running in CachedBnsClient’s run method! The number of these specific threads was exactly 12,452 — matching the CachedBnsClient instance count!

Re-examining the Code

Re-examining the CachedBnsClient code revealed a critical oversight:

java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
public CachedBnsClient(BnsClient client) {
    super();
    this.backendClient = client;
    new Thread() {
        @Override
        public void run() {
            for (;;) {
                refreshCache();
                try {
                    Thread.sleep(60 * 1000);
                } catch (InterruptedException e) {
                    logger.error("Error", e);
                }
            }
        }
    }
}

This code is the CachedBnsClient constructor — it creates an infinite loop thread inside that refreshes the cache every 60 seconds!

Key Discovery

Seeing the 12,452 business threads waiting in CachedBnsClient.run, it was immediately clear that these threads were causing the off-heap memory leak. Next, we needed to verify whether the leaked memory volume could indeed cause an OOM.

Memory Calculation Problem

Since the configured Xss is 512K, meaning each thread stack is 512K, the calculation is:

1
12563 * 512K = 6331M = 6.3G

The entire environment has 4G total, plus the JVM heap memory of 1.8G (1792M), which clearly exceeds 4G:

1
(6.3G + 1.8G) = 8.1G > 4G

But this calculation was obviously problematic — if true, the application would have OOM’d long ago.

Deep Analysis

Java Thread Implementation at the OS Level

JVM threads on Linux are created by calling NPTL (Native Posix Thread Library). A JVM thread corresponds to a Linux lwp (lightweight process), and a thread.start is essentially a do_fork.

When the JVM starts with -Xss=512K (thread stack size), 8K of this 512K is mandatory — shared by the process kernel stack and thread_info. The available user-mode stack memory is:

1
512K - 8K = 504K

Linux Physical Memory Mapping

Linux is very frugal with physical memory usage. Initially, only virtual memory linear regions are allocated, not actual physical memory. Physical memory is only allocated when actually needed — known as demand paging.

Checking smaps for Process Memory Usage

Use the following command to check actual physical memory usage:

1
cat /proc/[pid]/smaps > smaps.txt

Actual physical memory usage information:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
7fa69a6d1000-7fa69a74f000 rwxp 00000000 00:00 0 
Size:                504 kB
Rss:                  92 kB
Pss:                  92 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        92 kB
Referenced:           92 kB
Anonymous:            92 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

Searching for 504KB entries, there were exactly 12,563 — corresponding to 12,563 threads. Rss shows actual physical memory of 92KB, and Pss shows actual physical memory (proportionally shared libraries) of 92KB (since there are no shared libraries, Rss == Pss).

Examining dozens of matching entries, most fell between 92K-152K. Adding the kernel stack 8K:

1
(92 + 152)/2 + 8K = 130K

Rounded to 128K, representing the average thread stack size for this application.

Recalculating Memory

The JVM initially requested:

1
-Xmx1792m -Xms1792m

That’s 1.8G of on-heap memory, allocated immediately with physical page frames from the start.

12,563 threads, each with an average thread stack of 128K:

1
128K * 12563 = 1570M = 1.5G of off-heap memory

Adding the JVM’s 1.8G brings us to 3.3G, plus memory used by the kernel, log transfer processes, and others — indeed approaching 4G. The memory numbers match up!

Physical Machine Verification

Logging into the original physical machine, we found the same off-heap memory leak phenomenon — its physical memory usage had already reached over 5G. Dumping the application threads on the physical machine:

1
A total of 28,737 threads, of which 28,626 were waiting in CachedBnsClient.

Using smaps to check the process’s actual memory information again showed the same average of 128K. Continuing the physical memory calculation:

1
1.8 + (28737 * 128k)/1024K = (3.6 + 1.8) = 5.4G

This further validated our reasoning.

Why No Stuttering

Because almost all threads were sleeping on:

1
Thread.sleep(60 * 1000); // Sleep for 60s at a time

They only occupied memory — actual CPU time consumed was minimal.

Summary and Tool List

Lessons Learned

More on-site information is better: When troubleshooting bugs, collect as much on-site information as possible
Quantitative analysis is key: Memory leaks require quantitative analysis using inferred models
Deep analysis matters: When quantitative and actual results don’t match, dig deeper — you’ll discover new insights

Tool List

JVM Tuning Tools

jstat: JVM statistics monitoring tool
1
jstat -gcutil [pid] 1s
jinfo: View JVM runtime parameters
1
jinfo -flags [pid]

jmap: Memory mapping tool

1
2
jmap -heap [pid]          # View heap info
jmap -dump:format=b [pid] # Dump heap memory

jstack: Thread stack tool
1
jstack [pid] > thread.txt

GC Analysis Tools

GCViewer: Visual GC log analysis tool
GCEasy: Online GC log analysis platform
JConsole: JVM monitoring console
VisualVM: JVM performance analysis tool

System-Level Analysis Tools

/proc/[pid]/smaps: View detailed process memory usage
top: System process monitoring
free: Memory usage overview
vmstat: Virtual memory statistics

Tuning Principles

Define goals first: Determine the priority of latency, throughput, and capacity goals
Optimize incrementally: Adjust only one parameter at a time and compare results
Test thoroughly: Verify extensively in test environments before making production changes
Monitoring first: Establish a comprehensive monitoring system to detect anomalies early

Best Practices

Avoid over-tuning: Not all applications need complex GC tuning
Focus on business metrics: The ultimate goal of GC tuning is to improve user experience and business value
Continuous monitoring: Establish long-term monitoring and alerting mechanisms
Document everything: Record the tuning process and results for future maintenance

This off-heap memory leak investigation traces a path of drilling down through the evidence: from GC logs and heap dumps to physical memory accounting via /proc/[pid]/smaps — abnormal thread count → the infinite-loop thread in the constructor → actual physical memory footprint under demand paging.