JVM Performance Tuning and Off-Heap Memory Leak Troubleshooting in Practice
Introduction
JVM performance tuning and memory issue troubleshooting have always been significant challenges for Java developers. This article starts from the three core goals of GC tuning and combines them with a real-world off-heap memory leak investigation case to provide a systematic approach to performance tuning and practical troubleshooting methods.
In complex distributed systems, memory issues are often the hardest to diagnose. Off-heap memory leaks in particular, occurring outside the JVM heap, are difficult to detect with conventional GC monitoring tools and can easily cause systems to crash suddenly in production environments. This article shares a complete off-heap memory leak investigation process, hoping to provide valuable reference for readers.
Three Goals of GC Tuning: Latency, Throughput, Capacity
Core Concepts
GC tuning, like other performance optimization, requires following a scientific methodology. Before starting tuning, we need to clarify three core performance goals:
1. Latency
Latency refers to the upper bound requirement for GC pause times. Such goals typically come from business requirements:
- All user transactions must receive a response within 10 seconds
- 90% of order payment operations must be processed within 3 seconds
- Recommended products must be displayed to users within 100ms
With these performance metrics, you need to ensure that GC pauses do not exceed the latency requirements during transactions.
2. Throughput
Throughput refers to the amount of work a system can complete per unit of time. Common metrics include:
- Requests processed per second
- Operations completed per hour
- System resource utilization
Throughput focuses on the system’s overall performance rather than just individual operation response times.
3. Capacity
Capacity tuning is more about cost considerations — minimizing hardware configuration while meeting latency and throughput requirements, optimizing system resource utilization efficiency.
Factory Assembly Line Analogy
To better understand these three concepts, we can use a factory assembly line as an analogy:
- Latency: From the first bicycle frame part entering the assembly line to the finished bicycle leaving the line takes a total of 4 hours
- Throughput: One bicycle leaves the assembly line every minute, producing 60 bicycles per hour
- Capacity: The maximum production capacity the assembly line can support, which can be increased by adding more assembly lines
This analogy tells us that tuning can involve hardware upgrades (increasing capacity) or software optimization (reducing latency) — the right approach depends on actual requirements.
Tuning Parameters in Detail with Experimental Data
Basic Example Program
To demonstrate the practical effects of GC tuning, let’s look at a sample program:
| |
This program submits two tasks every 100 milliseconds, simulating different object lifecycles.
GC Log Configuration
When running the above program, you can enable GC logging with the following JVM parameters:
| |
Typical GC log output:
| |
Experimental Configuration Comparison
We ran the same code with three different configurations and obtained different results:
| Heap Size | GC Algorithm | Effective Work Ratio | Max Pause Time |
|---|---|---|---|
| -Xmx12g | -XX:+UseConcMarkSweepGC | 89.8% | 560 ms |
| -Xmx12g | -XX:+UseParallelGC | 91.5% | 1,104 ms |
| -Xmx8g | -XX:+UseConcMarkSweepGC | 66.3% | 1,610 ms |
Latency Tuning
Assume the requirement: each task must be processed within 1000ms. The actual task processing takes 100ms, so GC pauses cannot exceed 900ms.
From the experimental results, the ConcMarkSweepGC configuration meets this requirement:
| |
The corresponding GC log shows a maximum pause time of 560ms, meeting the 900ms latency target.
Throughput Tuning
Assume a throughput target of 13 million operation processes per hour.
Analyzing the experimental data, the ParallelGC configuration meets the requirement:
| |
The effective work ratio is 91.5%, calculated as:
| |
Capacity Tuning
While meeting latency and throughput requirements, we can try reducing hardware configuration. From the experimental data, the 8GB memory configuration meets latency requirements but has only 66.3% effective work ratio, indicating insufficient hardware resources.
In Practice: A Complete Off-Heap Memory Leak Investigation
Symptom Discovery
A system that had been running stably in production for three years was migrated from physical machines to a Docker environment. After running for a while, the monitoring system suddenly issued alerts for unavailable instances. The load balancer automatically removed the failed nodes.
| |
Checking OS monitoring revealed abnormal memory usage:
- The blue line shows total memory usage, rising continuously to 4G before exceeding system limits
- The maximum heap memory was set to 1792M — clearly an off-heap memory leak
Emergency Measures
Urgently restarted the application instances. After restart, memory usage was normal and everything appeared fine.
Initial Investigation
GC Log Analysis
First, we examined the GC logs and found that memory consistently dropped back to around 170M with no significant increase. Knowing that the JVM process itself was using nearly 4G of memory, this further confirmed off-heap memory as the cause.
Code Investigation
Examining the production service code, we found:
- No explicit use of off-heap memory
- No dependencies on additional native methods
- Network I/O code was managed by Tomcat, which was unlikely to have off-heap memory leaks
Deep Investigation
JVM Heap Dump
Since the problematic server in production had already been killed, fortunately there were several other machines. We found they also had significant off-heap memory usage, just hadn’t reached the OOM threshold yet.
Using jmap to dump the JVM heap:
| |
MAT Analysis
Using MAT to analyze the heap file, the heap usage showed a total of just over 200M — consistent with the 170M reported in the GC logs, far below the 4G level.
MAT indicated a potential memory leak point: the CachedBnsClient class had 12,452 instances, accounting for 61.92% of the entire heap.
Code Review
Most calls to CachedBnsClient in the system were through @Autowired annotations, and these instances were few. The only code that frequently created such instances was:
| |
Examining the CachedBnsClient class:
| |
Nothing appeared to suggest a memory leak.
Thread Information Analysis
Using jstack to dump thread information revealed that the more thread data available, the more clues could be found. In addition to normal I/O threads and framework daemon threads, there were an astonishing 12,563 extra threads:
| |
And these were running in CachedBnsClient’s run method! The number of these specific threads was exactly 12,452 — matching the CachedBnsClient instance count!
Re-examining the Code
Re-examining the CachedBnsClient code revealed a critical oversight:
| |
This code is the CachedBnsClient constructor — it creates an infinite loop thread inside that refreshes the cache every 60 seconds!
Key Discovery
Seeing the 12,452 business threads waiting in CachedBnsClient.run, it was immediately clear that these threads were causing the off-heap memory leak. Next, we needed to verify whether the leaked memory volume could indeed cause an OOM.
Memory Calculation Problem
Since the configured Xss is 512K, meaning each thread stack is 512K, the calculation is:
| |
The entire environment has 4G total, plus the JVM heap memory of 1.8G (1792M), which clearly exceeds 4G:
| |
But this calculation was obviously problematic — if true, the application would have OOM’d long ago.
Deep Analysis
Java Thread Implementation at the OS Level
JVM threads on Linux are created by calling NPTL (Native Posix Thread Library). A JVM thread corresponds to a Linux lwp (lightweight process), and a thread.start is essentially a do_fork.
When the JVM starts with -Xss=512K (thread stack size), 8K of this 512K is mandatory — shared by the process kernel stack and thread_info. The available user-mode stack memory is:
| |
Linux Physical Memory Mapping
Linux is very frugal with physical memory usage. Initially, only virtual memory linear regions are allocated, not actual physical memory. Physical memory is only allocated when actually needed — known as demand paging.
Checking smaps for Process Memory Usage
Use the following command to check actual physical memory usage:
| |
Actual physical memory usage information:
| |
Searching for 504KB entries, there were exactly 12,563 — corresponding to 12,563 threads. Rss shows actual physical memory of 92KB, and Pss shows actual physical memory (proportionally shared libraries) of 92KB (since there are no shared libraries, Rss == Pss).
Examining dozens of matching entries, most fell between 92K-152K. Adding the kernel stack 8K:
| |
Rounded to 128K, representing the average thread stack size for this application.
Recalculating Memory
The JVM initially requested:
| |
That’s 1.8G of on-heap memory, allocated immediately with physical page frames from the start.
12,563 threads, each with an average thread stack of 128K:
| |
Adding the JVM’s 1.8G brings us to 3.3G, plus memory used by the kernel, log transfer processes, and others — indeed approaching 4G. The memory numbers match up!
Physical Machine Verification
Logging into the original physical machine, we found the same off-heap memory leak phenomenon — its physical memory usage had already reached over 5G. Dumping the application threads on the physical machine:
| |
Using smaps to check the process’s actual memory information again showed the same average of 128K. Continuing the physical memory calculation:
| |
This further validated our reasoning.
Why No Stuttering
Because almost all threads were sleeping on:
| |
They only occupied memory — actual CPU time consumed was minimal.
Summary and Tool List
Lessons Learned
- More on-site information is better: When troubleshooting bugs, collect as much on-site information as possible
- Quantitative analysis is key: Memory leaks require quantitative analysis using inferred models
- Deep analysis matters: When quantitative and actual results don’t match, dig deeper — you’ll discover new insights
Tool List
JVM Tuning Tools
jstat: JVM statistics monitoring tool
1jstat -gcutil [pid] 1sjinfo: View JVM runtime parameters
1jinfo -flags [pid]jmap: Memory mapping tool
1 2jmap -heap [pid] # View heap info jmap -dump:format=b [pid] # Dump heap memoryjstack: Thread stack tool
1jstack [pid] > thread.txt
GC Analysis Tools
- GCViewer: Visual GC log analysis tool
- GCEasy: Online GC log analysis platform
- JConsole: JVM monitoring console
- VisualVM: JVM performance analysis tool
System-Level Analysis Tools
- /proc/[pid]/smaps: View detailed process memory usage
- top: System process monitoring
- free: Memory usage overview
- vmstat: Virtual memory statistics
Tuning Principles
- Define goals first: Determine the priority of latency, throughput, and capacity goals
- Optimize incrementally: Adjust only one parameter at a time and compare results
- Test thoroughly: Verify extensively in test environments before making production changes
- Monitoring first: Establish a comprehensive monitoring system to detect anomalies early
Best Practices
- Avoid over-tuning: Not all applications need complex GC tuning
- Focus on business metrics: The ultimate goal of GC tuning is to improve user experience and business value
- Continuous monitoring: Establish long-term monitoring and alerting mechanisms
- Document everything: Record the tuning process and results for future maintenance
Through this off-heap memory leak investigation, we not only learned specific troubleshooting methods but, more importantly, developed a systematic approach to problem analysis. In complex systems, even seemingly simple configurations can trigger serious issues — only by deeply understanding the underlying principles can we quickly locate and resolve problems.