Even unpredictable weather is being forecasted. But after all these technological advancements, are we able to forecast our application performance & availability? Are we able forecast even for the next 20 minutes? Will you be able to say that in the next 20 minutes application is going to experience OutOfMemoryError, CPU spikes, crashes? Most likely not. It’s because we focus only on macro-metrics:

  • Memory utilization
  • Response time
  • CPU utilization

EXAMPLE: 1

Fig: You can notice repeated full GCs triggered (graph from GCeasy.io) 

Memory related micrometrics

There are 4 memory/garbage collection related micrometrics that you can monitor:

  • Garbage collection Throughput
  • Garbage collection Pause time
  • Object creation rate
  • Peak heap size

Let’s discuss them in this section.

# 1. GARBAGE COLLECTION THROUGHPUT

Garbage Collection throughout is the amount of time application spends in processing customer transactions vs amount of time application spends in doing garbage collection.

Let’s say your application has been running for 60 minutes. In this 60 minutes, 2 minutes is spent on GC activities.

It means application has spent 3.33% on GC activities (i.e. (2 / 60) * 100).

It means Garbage Collection throughput is 96.67% (i.e. 100 – 3.33).

When there is a degradation in the GC throughput, it’s an indication of some sort of memory problem is brewing in the application.

# 2. GARBAGE COLLECTION LATENCY

Fig:GC Throughput & GC Latency micrometric

# 3. OBJECT CREATION RATE

Fig: Object creation rate micrometric

# 4. PEAK HEAP SIZE

Fig: Peak Heap size micrometric

How to generate memory related micrometrics?

All the memory related micrometrics can be sourced from garbage collection logs.

(1). You can enable the garbage collection logs by passing following JVM arguments:

Till Java 8:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<file-path>

From Java 9:

-Xlog:gc*:file=<file-path>

EXAMPLE 2



Fig: Growing RUNNABLE state Thread count (graph from fastThread.io)

Thread related micrometrics

There are 4 thread related micrometrics that you can monitor:

  • Thread Count
  • Thread States
  • Thread Groups
  • Thread Execution patterns

Let’s discuss them in this section.

# 5. THREAD COUNT

# 6. THREAD STATES

Fig: Thread states micrometric

# 7. THREAD GROUPS

Fig: Thread Group micrometric

Fig: Threads with identical stacktrace

Whenever significant number of threads starts to exhibit identical/repetitive stack trace then it may be indicative of performance problems. Consider these scenarios:

(a). Say your SOR or external service is slowing down then a significant number of threads will start to wait for its response. In such circumstance, those threads will exhibit same stack trace.

(b). Say a thread acquired a lock & it never released then, then several other threads which are in the same execution path will get into the blocked state, exhibiting same stack trace.

(c). If a loop (for loop, while loop, do..while loop) condition doesn’t terminate then several threads which execute that loop will exhibit the same stack trace.

When any of the above scenarios occurs application’s performance, and availability will be jeopardized. You might want to focus on thread execution patterns.

How to generate thread related micrometrics?

All the thread related micrometrics can be sourced from thread dump:

  1. There are several options to capture thread dump from your application. You may choose the option that is convenient to you. It’s advisable to capture thread dumps in gap of 5 – 10 seconds to do analysis.
  2. Once thread dumps are generated you can either manually analyze them through thread dump analysis tools such as  fastThread.io or using programmatic REST API. REST API is useful when you want to automate the report generation process. It can be used in CI/CD pipeline as well.

Network related micrometrics

There are 3 network related micrometrics that you can focus on:

  • TCP/IP connection count by host
  • TCP/IP states
  • Open File descriptors

Let’s discuss them in this section.

# 9. TCP/IP CONNECTION COUNT BY HOST

Modern application connects with multiple external applications (Needless to say in microservices world, where there is too many external connectivity). Connections are established in various protocols: HTTP, HTTPS, SOAP, REST, JDBC, JMS, Kafka… In this kind of ecosystem your application’s responsiveness and availability is dependent on external applications availability and responsiveness as well. Thus, you need to monitor number of connections established from your application to external applications. If you see connection count to be growing more than normal traffic volume pattern, then it can be a concern. Whenever there is slow down in the external application, there is a possibility for your application to open more and more connections to the external application to handle the incoming transactions.

You can find number of established connections to external systems using the ‘netstat’ command, as shown below:

 $ netstat -an | grep ESTABLISHED | grep '162.187.223.11' | wc -l

Above command will show number of established connections to the host ‘162.187.223.11’.

# 10. TCP/IP STATES

$ netstat -an | grep 'TIME_WAIT' | wc -l

Above command shows number of connections in ‘TIME_WAIT’ state. Similarly, you can grep for other states as well.

# 11. OPEN FILE DESCRIPTORS

File descriptor is a handle to access

(a). File

(b). Pipe (is a mechanism for inter-process communication using message passing. i.e. ls -l | grep key | less)

(c). Network Connections.

If you notice File descriptors counts to be growing in your application, it can be lead indicator that application isn’t closing resources properly. Unclosed file descriptors after it’s utilization will lead to performance/availability problems.

Below command will report all the open file descriptors for the process Id ‘5666’.

lsof -p 5666

If you want to know number of open file descriptors for the same process, you need to issue the command:

 $ lsof -p 5666 | wc -l 
153

Storage related micrometrics

There are 3 storage related micrometrics that you can focus on:

  • IOPS
  • Storage Throughput
  • Storage Latency

Let’s discuss them in this section.

# 12. IOPS

IO operations per second, which means the amount of read or write operations that could be done in one second time. For certain IO operations, IO request size can be very small. Examples of IO size could be 4 KB, 8 KB, 32 KB and so on. So larger IO request sizes could mean less IOPS

13. STORAGE THROUGHPUT

Average IO size x IOPS = Throughput in MB/s

# 14. STORAGE LATENCY

Database related micrometrics

# 15. DB – Locks

# 16. Long running queries

# 17. Evictions of in-memory tables

# 18. Hits/miss ratio

Conclusion