Most of the Java production performance problems (such as CPU spike, slowdown, hang, or deadlock…), are reflected in the application threads behaviour. Thus, capturing and analyzing thread dumps in production can help you identify the root cause of these complex issues instantly. In this post, I’ll share best practices for capturing thread dumps effectively.

There are 9 different options to capture thread dump. You can use the option which suits your organization’s security concerns and preferences.

Video

In this webinar, we delved into 9 expert tips and tricks to help you master thread dump analysis. Whether you’re dealing with high CPU usage, unresponsive applications, or other performance challenges, these insights equipped participants with practical knowledge to navigate through intricate thread dump data with ease and precision.

1. Capture Multiple Dumps at Regular Intervals

A thread dump is a snapshot of all the threads running in your application at a specific moment. To determine whether a thread is genuinely stuck or just momentarily paused, it’s essential to capture multiple snapshots at regular intervals. For most business applications, capturing three thread dumps at 10-second intervals is a good practice. This approach allows you to observe if a thread remains stuck in the same code line across multiple snapshots.

2. Capture During the Peak of the Performance Problem

In many organizations, the instinctive response 😊 to a performance issue is to restart the application. However, capturing thread dumps after a restart is ineffective because the root cause may no longer be present. It’s critical to capture thread dumps while the issue is actively occurring. Think of it like a doctor drawing a blood sample while a patient is experiencing symptoms—waiting until after the symptoms have subsided won’t provide the information needed for an accurate diagnosis. Similarly, to effectively troubleshoot, you need to capture thread dumps during the performance problem, not after the application has been restarted.

3. Capture Supporting Artifacts for Comprehensive Analysis

Thread dumps alone can provide a lot of information, but capturing additional artifacts will give you a more comprehensive view of the issue and help with effective troubleshooting:

top -H -p <PROCESS_ID>: A CPU spike is a common performance issue. Combining thread dumps with the top -H -p <PROCESS_ID> command is an effective strategy for troubleshooting this problem. This command shows the CPU and memory usage for each thread in the specified process, allowing you to identify high-consuming threads. By linking this top data with the thread dump data, you can pinpoint the exact lines of code causing the CPU spike. Thread dump analysis tools like fastThread automatically perform this linkage and generate a “CPU Consumption by Thread” report, as shown below:

Fig: CPU Consumption by Thread generated by fastThread tool

Garbage Collection Log: Frequent or long garbage collection pauses can cause application threads to stall, appearing as if they are not progressing in the thread dump. By analyzing the Garbage Collection log alongside the thread dump, you can determine whether thread stalling is due to garbage collection events. For example, if you observe that threads are consistently in a WAITING state in the thread dump and there are frequent GC events in the log, this correlation can indicate that GC activity is causing the threads to pause.

netstat: Slow backend systems are another common performance bottleneck. Combining thread dumps with the netstat command is an effective way to troubleshoot this issue. The netstat command provides information about the network connections established by your application, helping you identify if threads are waiting for responses from external systems such as databases or remote services. By linking this netstat data with thread dump data, you can determine whether a high number of open connections corresponds with threads that appear to be waiting indefinitely, indicating that the bottleneck may be due to slow responses from backend systems.

You can use the open-source yc script which captures 16 troubleshooting artifacts including three snapshots of thread dumps, top -H, GC Log and netstat.

4. Archive Thread Dumps Securely

Thread dumps often contain sensitive information, such as framework details, third-party libraries, hostnames, and even IP addresses. In many enterprises, production problems are diagnosed by sharing dumps across multiple locations—SRE engineers download them to local machines, upload them to shared drives, and developers and QA engineers access and transfer them multiple times. This widespread handling of dumps poses significant security risks.

To mitigate these risks, it is crucial to securely archive or purge thread dumps after analysis. If you use the yCrash tool, production dumps are securely transmitted directly from your production servers to a yCrash server running within your corporate network. This approach ensures that raw dumps remain inaccessible to individuals, and only generated reports are available for analysis. By centralizing and securing dump storage, yCrash minimizes the risk of sensitive data exposure. For more information on yCrash’s security features, refer to this documentation

5. Use Forced Option When JVM is Unresponsive

If the JVM becomes unresponsive, it might not respond to regular thread dump capture requests. In such cases, you can use the -F option with the jstack tool to force a thread dump capture. Although the dump captured using this method might be incomplete, with limited details such as missing thread states or lock information, it’s better than having no data at all. Note that the jstack -F option is not available in JDK 11 and above. In such situations, the jhsdb jstack option is a reliable alternative. If you use the open-source yc script, it automatically attempts six different options to capture thread dumps in pristine format, as detailed in this documentation.

Conclusion

By following these best practices for capturing thread dumps, you can gain valuable insights into your application’s behavior during performance issues. Whether you use automated tools like the ‘yc script’ or capture thread dumps manually, remember to prioritize data security and capture supporting artifacts for a more thorough analysis.