A travel organization in North America encountered unresponsive microservices because of too many threads. The Site Reliability Engineering (SRE) team analyzed thread dumps and found that 2,319 threads were stuck waiting for network responses due to an issue with a Cassandra database. Fixing a disk space shortage restored normal performance and helped prevent future problems. This approach was essential for quick and effective problem-solving.
To troubleshoot production problems, analyzing thread dumps is essential, as each dump may contain hundreds or thousands of threads in various states: NEW, RUNNABLE, WAITING, TIMED_WAITING, BLOCKED, and TERMINATED. The fastThread application offers a comparative summary view, simplifying the visualization of these states across multiple thread dump snapshots for effective troubleshooting.
