Why JVMTI sampling is better than async sampling on modern JVMs

In recent years, “async sampling” has been hyped as a better way of CPU profiling on the JVM. While this has been true for some time, it is no longer the case. This blog post explains the history of sampling and the current state of the art.

The problem

The fundamental operation of a CPU profiler is to associate time measurements with call stacks. To obtain call stacks from all live threads, most CPU profilers perform sampling. Sampling means that data is measured periodically rather than continuously. However, this measurement is not trivial because the JVM does not store call stack information for easy access via an API.

The JVM compiles Java bytecode to native code for hot execution paths. The stack traces now have a native part that needs to be retranslated to Java in order to be useful. Also, it is not possible to ask a running thread what it’s currently doing, but you have to interrupt it to get a defined state. Depending on how you do it, this introduces an observer effect that can severely alter the execution of the program.

Historically, sampling could only be done at a global safepoint. Global safepoints are states where the JVM can pause all threads for operations that require a consistent view of memory. At a safepoint, all Java threads are paused, ensuring they are not in the middle of modifying shared data. While safepoints originated from the stop-the-world phase of garbage collectors, they could also be used for other purposes, such as code deoptimization, class redefinition, and finally to safely get all stack traces for the purposes of sampling.

In contrast to garbage collector activity, sampling is performed quite frequently, on the order of once per millisecond. Requesting a global safepoint so many times per second can cause substantial overhead and a severe distortion of the observed hot spots. The adverse effects of global safepoints are especially pronounced for heavily multithreaded applications, where safepoints can limit concurrency, reduce thread cooperation and increase contention and synchronization overhead. The observed hot spots will then be skewed towards the safepoints, an effect known as safepoint bias.

Async profiling to the rescue

Unhappy with this state of affairs, the HotSpot JVM developers added the experimental AsyncGetStackTrace API that allowed profilers to get the stack trace of threads without requiring a safepoint. On Unix systems, profilers can use various signal mechanisms to periodically interrupt the running thread and execute a handler on the interrupted thread and call this API.

While this adds minimal overhead, it unfortunately is not an ideal solution. The main problem is that the interrupted thread and the JVM in general are in an unsafe state. The handler must be very careful not to perform any operation which might crash the process. These restrictions, especially regarding memory access and allocations make some advanced features of a profiler unfeasible. Also, the retranslation of the stack traces back to the Java stack trace results in a lot of stack traces being truncated or otherwise invalid. In addition, async sampling is also sampling from the pool of all live threads, so rarely scheduled threads will sometimes be missed completely.

JProfiler supports JVMTI sampling as well as async sampling, and as part of our tests we perform a lot of data comparisons. While async sampling is near-zero overhead and eliminates safepoint bias, it introduces a certain “trashiness” to the data and places limits on which features are possible to implement. For a long time, it seemed like an unavoidable trade-off with only bad options. One could choose one or the other sampling method based on the type of application and accept the corresponding drawbacks.

ZGC and thread-local handshakes

In the quest for an ultra-low latency garbage collector, the JVM developers needed a way to perform operations on individual threads without requiring a global VM safepoint. In Java 10, JEP 312 was delivered with no concrete use cases. While intriguing at the time, we were unable to make any use of this feature because it was internal to the JVM and not available for profiling agents.

The only publicly mentioned purpose in JEP 312 was that it blocked JEP 333 for the Z Garbage Collector (ZGC). In Java 11, ZGC was delivered as one of the big-ticket items, and thread-local handshakes helped it push maximum GC pause times below 10ms.

ZGC continued to be improved, and JEP 376 aimed to evict its last major work item from global safepoints: The processing of thread-related GC roots was moved to a concurrent phase. That JEP included a goal to “provide a mechanism by which other HotSpot subsystems can lazily process stacks” and issues like https://bugs.openjdk.org/browse/JDK-8248362 added this capability to the JVMTI, the interface that is used by native profiling agents.

JVMTI sampling strikes back

With Java 16, JEP 376 was delivered, and it was possible for profiling agents to use the lazy stack processing based on thread-local handshakes and avoid global safepoints for sampling. JVMTI sampling (called “full sampling” in JProfiler), is now comparable in overhead with async sampling and, given the frequency of local safepoints, the remaining local safepoint bias is irrelevant for the vast majority of applications.

Let’s compare a real-world use case. A multithreaded Maven compilation for a medium-sized project was recorded with both JVMTI sampling and async sampling. The overhead is not measurably different, and the hot spot distribution is very similar. See the “Hot spots” view in JProfiler below, first for JVMTI sampling:

Sampling with JVMTI

… and then for async sampling:

Is the difference the remaining safepoint bias? Not necessarily so, and most probably not. There are two other important factors at play: First, async sampling measures threads that are actually running while JVMTI sampling measures if threads are scheduled for execution, that is if they are “runnable”. The JProfiler UI reflects this in the labelling of the thread status.

Second, async sampling operations do not work all the time. The technical “outages” are summed up at the bottom of the call tree without contributing to the total percentage:

These are substantial times and they do contribute some skew. So while async sampling measures actual CPU times, they are only partial CPU times and not a more useful measure than the runnable time measured by JVMTI sampling.

As an example of a feature that async sampling cannot provide, try to change the thread state to “Waiting”, “Blocking” or “Net I/O”. With async sampling, these are not available, unlike with JVMTI sampling.

When working with databases or REST services, having access to the “Net I/O” state is an invaluable benefit, though, because these are the times waiting for the external service.

What remains for async sampling? Async sampling can profile native stack traces, so if you are interested in that, it remains a useful tool. Other than that, full sampling is now by far the better alternative, in terms of data quality and available features, without compromising on overhead.

We invite you to try it out in the latest JProfiler release. Happy profiling!

How invokedynamic makes lambdas fast

Recently, we have been at work rewriting our website in Kotlin. Instead of a view technology that uses string templates with embedded logic, we now use the Kotlin HTML builder to develop views as pure Kotlin code. This has a number of advantages, like being able to easily refactor common code. Also, the performance of such views is much better than that of string templates, which contain interpreted code snippets.

When measuring the performance, we noticed that a lot of anonymous classes were created for our views and their loading time was significant. Code that uses the Kotlin HTML builder is very lambda-heavy and as of Kotlin 1.9, lambdas are implemented as anonymous classes. The JVM has a sophisticated mechanism to avoid creating classes at compile time that was introduced in Java 8 – the LambdaMetafactory and invokedynamic . The JVM developers also claimed that the performance would be better than anonymous classes. So why does Kotlin not use that?

As it turns out, Kotlin can optionally compile lambdas with invokedynamic in the same way that Java does, by passing -Xlambdas=indy to the Kotlin compiler. This has been supported since Kotlin 1.5 and will become the default in the upcoming Kotlin 2.0 release. The great thing about having both compilation strategies available, is that we can compare how anonymous classes and invokedynamic compare in a real-world example.

First of all, the number of classes for our entire website project was reduced by 60% (!) when compiling with -Xlambdas=indy . Here you can see the list of classes for our store view with both compilation modes:

For that particular view, the cold rendering time was improved by 20%. This was simply measured by wrapping the rendering with a measurement function. How about the warmed-up rendering time? The times there are much shorter, and we need to introduce some statistics by making many invocations. This is easily done with JProfiler and has the added benefit that we can also compare the internal call structure.

With the default compilation to anonymous classes, we recorded 50 invocations after warm-up and got 12.0 ms per rendering:

With invokedynamic compilation, the time per rendering was 10.6 ms:

This is an improvement of 12% which is surprisingly large. This test does not measure the difference between the invocation mechanisms in isolation, but it includes the actual work that is done to render the store view. Against that baseline duration, a speed-up of 12%, – just by changing the compilation mode for lambdas – is quite impressive. Many DSL-based libraries in Kotlin are lambda-heave, so other use cases may also produce similar numbers.

By looking at the call tree, we can see that the version with anonymous classes makes 3 calls, instead of one: First, it instantiates the anonymous class and passes all the captured parameters:

Then it calls a bridge method without the captured parameters, which in turn calls the actual implementation:

Looking at the bytecode, we can see that a number of instructions are required to store the captures parameters into fields, and the bridge method also contains instructions that add to the overhead.

With invokedynamic compilation, the generated lambda methods are in the same class:

This works because the lambda instances are created by invokedynamic calls to so-called bootstrap methods.

Bootstrap methods are structures in the class file that contain signature information for the lambda and a method handle reference to a static method. The LambdaMetafactory then efficiently creates an instance for the lambda.

This intricate mechanism makes lambda calls on the JVM as fast as they are – and from Kotlin 2.0 on this will be the default for Kotlin as well.

Garbage collector analysis in JProfiler

This screencast shows how to use the garbage collector probe in JProfiler. Having access to detailed information about the overall activity of the GC, as well as the single garbage collections, is crucial for tuning the garbage collector and achieving an optimal performance for your application.

Recording JFR snapshots with JProfiler

Recording JFR snapshots with JProfiler

This screencast shows JProfiler’s versatile functionality as a JFR recording controller. As an example, a JFR recording on a Kubernetes cluster is recorded and the resulting snapshot is shown in JProfiler. In this context, you can see the wizard for configuring JFR recording settings. In addition, JFR recordings of terminated JVMs and the handling of externally started JFR recordings are demonstrated.

Enhanced JFR snapshot analysis with JProfiler

JProfiler has excellent support for viewing JFR snapshots. This screencast focuses on the event browser, which is specific to JFR snapshots, and also gives an overview of the other view sections that offer some of the same views as regular profiling sessions.

Working with probe events in JProfiler

Probe events are of great help in debugging specific performance problems. To find events of interest, JProfiler gives you a lot of tools to narrow down the set of displayed events.

This screencast shows the HTTP server and HTTP client probes, the JDBC and JPA/Hibernate probes as well as the socket probe when profiling a real-world application. The various ways of filtering probe events as well as duration and throughput histograms are explained.

The profiled application is the CommaFeed RSS reader.

Customizing telemetries in JProfiler

Telemetries are an essential feature for a profiler, they help you get an idea about when things happen in the profiled JVM, and how various subsystems are correlated.

This screencast shows how to customize the telemetries section in JProfiler by adding probe telemetries. It discusses bookmarks, recording actions and setting time range filters for probe events in probe telemetries.

Improved Kubernetes authentication handling in JProfiler

Since version 13, JProfiler supports profiling on Kubernetes clusters with no extra configuration.

JProfiler 13.0.6 added an important improvement for profiling a JVM in Kubernetes clusters where authentication is set up in such a way that the authentication plugin prints instructions on stdout.

For example, when the Azure Kubernetes Service (AKS) is configured to authenticate with Active Directory (AD), users will be required to authenticate their identity using the kubelogin authentication plugin. The plugin will initiate a multi-factor authentication (MFA) process by prompting the user to open a URL and enter a one-time code. Previously, the information was not visible when JProfiler was making a connection to a Kubernetes cluster.

Since JProfiler 13.0.6, the progress dialog will be expanded if such output is detected. The command line output is then shown in a text area where text can be copied to the clipboard and links can be opened directly the system web browser:

After completing the instructions, the connection will be made and the pods in the cluster will be listed.

Profiling Java applications in a Kubernetes cluster

This screencast shows how you can profile JVMs running in Kubernetes cluster with JProfiler. A profiling session with a note taking demo application sessions is started from the IDE, which provides additional benefits, like source code navigation and the automatic detection of profiled packages. Also, a standalone session is started, where an additional SSH connection is required to reach the kubectl command that can connect to the cluster.