Diagnosing and troubleshooting CPU problems in production that too in cloud environment can become tricky and tedious. Your application might have millions of lines of code, trying to identify the exact line of code that is causing the CPU to spike up, might be equivalent of finding a needle in the haystack. In this article, let’s learn how to find that needle (i.e. CPU spiking line of code) in a matter of seconds/minutes.

To help readers better understand this troubleshooting technique, we built a sample application and deployed it into AWSEC2 instance. Once this application was launched, it caused CPU consumption to spike up to 199.1%. Now let’s walk you through the steps that we followed while troubleshooting this problem. Basically, there are 3 simple steps:

  1. Identify threads that consume CPU
  2. Capture thread dumps
  3. Identify lines of code that is causing CPU to spike up

1. Identify threads that are causing CPU to spike

In the EC2 instance, multiple processes could be running. The first step is to identify the process that is causing the CPU to spike up. Best way to do is to use the ‘TOP’ command that is present in *nixflavor of operating systems.

Issue command ‘top’ from the console

$ top

This command will display all the processes that are running in the EC2 instance sorted by high CPU consuming processes displayed at the top. When we issued the command in the EC2 instance we were seeing the below output:

Fig:‘top’ command issued from an AWS EC2 instance

From the output, you can notice process# 31294to be consuming 199.1% of CPU. It’s pretty high consumption. Ok, now we have identified the process in the EC2 instance which is causing the CPU to spike up. Next step is to identify the threads with in this process that is causing the CPU to spike up.

Issue command ‘top -H -p {pid}’ from the console. Example

$ top -H -p 31294

From the output you can notice:

This command will display all the threads that are causing the CPU to spike up in this particular 31294 process. When we issued this command in the EC2 instance, we were seeing the below output:

Fig:‘top -H -p {pid}’ command issued from an AWS EC2 instance

From the output you can notice:

  • Thread Id 31306 consuming 69.3%of CPU
  • Thread Id 31307 consuming 65.6%of CPU
  • Thread Id 31308 consuming 64.0%of CPU
  • Remaining all other threads consume negligible amount of CPU.

This is a good step forward, as we have identified the threads that are causing CPU to spike. As the next step, we need to capture thread dumps so that we can identify the lines of code that is causing the CPU to spike up.

2. Capture thread dumps

A thread dump is a snapshot of all threads that are present in the application. Thread state, stack trace (i.e. code path that thread is executing), thread Id related information of each thread in the application is reported in the thread dump.

There are 8 different options to capture thread dumps. You can choose the option that is convenient for you. One of the simplest options to take thread dump is to use tool ‘jstack’ which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Below is the command to capture thread dump:

 jstack -l  {pid} > {file-path} 

where

pid: is the process Id of the application, whose thread dump should be captured

file-path: is the file path where thread dump will be written in to.

Example:

jstack-l 31294 > /opt/tmp/threadDump.txt 

As per the example, thread dump of the process would be generated in /opt/tmp/threadDump.txt file.

3. Identify lines of code that is causing CPU to spike up

Next step is to analyze the thread dump to identify the lines of code that is causing the CPU to spike up. We would recommend analyzing thread dumps through fastThread, a free online thread dump analysis tool.

Now we uploaded captured thread dump to fastThread tool. Tool generated this beautiful visual report. Report has multiple sections. On the right top corner of the report, there is a search box. There we entered the Ids of the threads which were consuming high CPU. Basically, thread Ids that we identified in step #1 i.e. ‘31306,31307, 31308’.

fastThread tool displayed all these 3 threads stack trace as shown below.

Fig: fastThread tool displaying CPU consuming thread

You can notice all the 3 threads to be in RUNNABLE state and executing this line of code:

 com.buggyapp.cpuspike.Object1.execute(Object1.java:13) 

Apparently following is the application source code

 
1: package com.buggyapp.cpuspike;
2:
3: /**
4: * 
5: * @author Test User
6: */
7: public class Object1 {
8:	
9:	public static void execute() {
10:		
11:		while (true) {
12:		
13:			doSomething();
14:		}		
15:	}
16:	
17:	public static void doSomething() {
18:		
19:	}
20: } 

You can see line #13 in object1.java tobe ‘doSomething();’. You can see that ‘doSomething()’ method to do nothing, but it is invoked an infinite number of times because of non-terminating while loop inline# 11. If a thread starts to a loopinfinite number of times, then CPU will start to spike up. That is what exactly happening in this sample program. If non-terminating loop in line #11 is fixed, then this CPU spike problem will go away.

Note: If you would like to learn more CPU basics & fundamentals, here is a good article.

Conclusion

To summarize first we need to use ‘TOP’tool to identify the thread Ids that are causing the CPU spike up, then we need to capture the thread dumps, next step is to analyze thread dumps to identify exact lines of code that is causing CPU to spike up. Enjoy troubleshooting, happy hacking!