user-icon Christian Vögele
16. July 2018
timer-icon 5 min

Automatic Problem Diagnosis with inspectIT

With release 1.8 of our Application Performance Management (APM) tool inspectIT, we integrated a new feature named Automatic Problem Diagnosis.

APM tools mainly provide alerting and visualization capabilities in order to detect performance problems during operation. However, to isolate and diagnose the real root cause of problems is still a highly manual and repetitive task which also requires a lot of APM expert knowledge. In response to this challenge, we implemented the Automatic Problem Diagnosis feature to automate this recurring APM activity. This feature was implemented as part of the diagnoseIT project funded by the German Federal Ministry of Education and Research (BMBF).

Take a Break, Let’s Diagnose It Automatically

Our Automatic Problem Diagnosis feature automatically analyses traces recorded by inspectIT for performance problems and the root cause(s) of each problem. Problems and root causes are then persisted in InfluxDB and visualized using Grafana. An integration to the inspectit UI is currently not available but planned for the future.

We implemented a set of generic rules to find performance problems for any type of trace. The results are then grouped by problems and root causes as multiple independent traces may encounter the same problems. inspectIT structures the results of the “Automatic Problem Diagnosis” in the following hierarchy:

  • Global Context: The deepest node in a trace that contains all performance problems
  • Problem Context: The deepest node in a trace that contains one specific performance problem
  • Root Cause: The calls to a specific method that characterize a performance problem. A Problem Context can have one or multiple root causes. Calls to a specific method are considered as root cause when the aggregated response times of these calls account for 80% of the duration of the trace. As an example, we have three different method calls in a trace: calls to method a(), b(), and c(). First, we calculate the sum of the response times per method call and then sort the method calls descending by the aggregated response time. Afterwards, we iteratively cumulate the percentage share of each method calls. When at least 80 % of the duration of the trace is explained we stop. In our example calls to method a() and b() are considered as root cause. Calls to method c() would not be considered any further.
    Method Sum of response times (ms) Percentage Cumulative Percentage
    a 180 60 % 60 %
    b 90 30 % 90 %
    c 30 10 % 100%
  • Root Cause Type: Defines the type of the root cause. Currently, we distinguish between 3 different types:
    • Single: One specific method call is responsible for the problem
    • Recursive:  One specific method is called recursively (a minimum stack size of 2)
    • Iterative: One specific method is called iteratively (at minimum 15 times)
  • Root Source Type: Defines the source of the root cause. Currently, we distinguish between 3 different types:
    • Http: Calls of the root cause are HTTP calls
    • Database: Calls of the root cause are database calls
    • Timerdata: All other calls.

The following example visualizes the results for one trace containing 3 different performance problems.

  • Method A is the Global Context, as it contains all three Problem Contexts.
  • Method B is the Problem Context of calculate(). As calculate() is only called once, its Root Cause Type is Single.
  • Method D is the Problem Context of the search() method calls which are called in a Recursive way.
  • Method searchInDB() is the Problem Context of the executeQuery() method calls which are called in an Iterative way (like a typical N 1 Problem pattern).

Configuring and Enabling the Automatic Problem Diagnosis

In the inspectIT UI, the diagnosis feature can be enabled in the preference settings where you can also set the minimum duration for traces to be diagnosed. The default baseline is set to 1000 ms. Be careful in configuring this value. If the baseline is too low, many requests will be analyzed which increases the load on inspectIT. Furthermore, it is also necessary to activate the InfluxDB in the Database Settings as we store the results in InfluxDB to be visualized using Grafana.

For demonstration purposes, we now have a look at a demo application called “DVD_Store” that we instrumented with inspectIT. We put some load on the application and analyze the requests in inspectIT. In the inspectIT UI we see several requests that have a duration higher than the baseline duration of 1000 ms:

But why are these requests slower than the others? Let’s have a closer look at this in Grafana.

Visualize Performance Problems

In Grafana we provide a sample dashboard for the Performance Problems (download here). If required, you can also easily adapt this dashboard to your needs. This dashboard visualizes the numbers of found problems as well as the mean response times of these problems. In the table at the bottom (Applications and Business Transactions of Found Problems) you can see that all slow requests are in the business transaction “/dvdstore/browse”.

The root cause table (RootCauses of Found Problems) shows that the Problem Context is the method service() of the Class FacesServlet. For reasons of simplicity we do not show the Global Context here. Additionally, we can see that there are a lot of single and iterative JDBC database calls which are related to the “/dvdstore/browse” business transaction.

Back in the inspectIT UI, we now have a look at one of the database calls for a slow request of business transaction “/dvdstore/browse”. In this example, one SQL statement had a response time of 773 ms and another statement has low response times but was called 541 times. We could have analyzed this for each slow request manually, but this way it would have been much harder to identify problems that relate to multiple requests.

Outlook

Using the Automatic Problem Diagnosis feature can help you to find performance problems much faster than with manual inspection. In the future, we plan to extend this feature with anomaly detection. The current constant baseline value can be replaced by a dynamically learned baseline for the response time of the application. Then, the baseline adapts automatically and does not have to be configured manually. Furthermore, we want to make the threshold values for the generic rules configurable. Moreover, we plan to adapt our Automatic Problem Diagnosis implementation also for other tracers like Zipkin or Jaeger, as these tracers do not provide this kind of feature yet.

Tell us what you think! If you are interested in automatic root cause detection or APM in general and would like to know more, talk to us in the comment section below or reach out to us via email at apm@novatec-gmbh.de.

Comment article