Category: General

Why JVMTI sampling is better than async sampling on modern JVMs

In recent years, “async sampling” has been hyped as a better way of CPU profiling on the JVM. While this has been true for some time, it is no longer the case. This blog post explains the history of sampling and the current state of the art.

The problem

The fundamental operation of a CPU profiler is to associate time measurements with call stacks. To obtain call stacks from all live threads, most CPU profilers perform sampling. Sampling means that data is measured periodically rather than continuously. However, this measurement is not trivial because the JVM does not store call stack information for easy access via an API.

The JVM compiles Java bytecode to native code for hot execution paths. The stack traces now have a native part that needs to be retranslated to Java in order to be useful. Also, it is not possible to ask a running thread what it’s currently doing, but you have to interrupt it to get a defined state. Depending on how you do it, this introduces an observer effect that can severely alter the execution of the program.

Historically, sampling could only be done at a global safepoint. Global safepoints are states where the JVM can pause all threads for operations that require a consistent view of memory. At a safepoint, all Java threads are paused, ensuring they are not in the middle of modifying shared data. While safepoints originated from the stop-the-world phase of garbage collectors, they could also be used for other purposes, such as code deoptimization, class redefinition, and finally to safely get all stack traces for the purposes of sampling.

In contrast to garbage collector activity, sampling is performed quite frequently, on the order of once per millisecond. Requesting a global safepoint so many times per second can cause substantial overhead and a severe distortion of the observed hot spots. The adverse effects of global safepoints are especially pronounced for heavily multithreaded applications, where safepoints can limit concurrency, reduce thread cooperation and increase contention and synchronization overhead. The observed hot spots will then be skewed towards the safepoints, an effect known as safepoint bias.

Async profiling to the rescue

Unhappy with this state of affairs, the HotSpot JVM developers added the experimental AsyncGetStackTrace API that allowed profilers to get the stack trace of threads without requiring a safepoint. On Unix systems, profilers can use various signal mechanisms to periodically interrupt the running thread and execute a handler on the interrupted thread and call this API.

While this adds minimal overhead, it unfortunately is not an ideal solution. The main problem is that the interrupted thread and the JVM in general are in an unsafe state. The handler must be very careful not to perform any operation which might crash the process. These restrictions, especially regarding memory access and allocations make some advanced features of a profiler unfeasible. Also, the retranslation of the stack traces back to the Java stack trace results in a lot of stack traces being truncated or otherwise invalid. In addition, async sampling is also sampling from the pool of all live threads, so rarely scheduled threads will sometimes be missed completely.

JProfiler supports JVMTI sampling as well as async sampling, and as part of our tests we perform a lot of data comparisons. While async sampling is near-zero overhead and eliminates safepoint bias, it introduces a certain “trashiness” to the data and places limits on which features are possible to implement. For a long time, it seemed like an unavoidable trade-off with only bad options. One could choose one or the other sampling method based on the type of application and accept the corresponding drawbacks.

ZGC and thread-local handshakes

In the quest for an ultra-low latency garbage collector, the JVM developers needed a way to perform operations on individual threads without requiring a global VM safepoint. In Java 10, JEP 312 was delivered with no concrete use cases. While intriguing at the time, we were unable to make any use of this feature because it was internal to the JVM and not available for profiling agents.

The only publicly mentioned purpose in JEP 312 was that it blocked JEP 333 for the Z Garbage Collector (ZGC). In Java 11, ZGC was delivered as one of the big-ticket items, and thread-local handshakes helped it push maximum GC pause times below 10ms.

ZGC continued to be improved, and JEP 376 aimed to evict its last major work item from global safepoints: The processing of thread-related GC roots was moved to a concurrent phase. That JEP included a goal to “provide a mechanism by which other HotSpot subsystems can lazily process stacks” and issues like https://bugs.openjdk.org/browse/JDK-8248362 added this capability to the JVMTI, the interface that is used by native profiling agents.

JVMTI sampling strikes back

With Java 16, JEP 376 was delivered, and it was possible for profiling agents to use the lazy stack processing based on thread-local handshakes and avoid global safepoints for sampling. JVMTI sampling (called “full sampling” in JProfiler), is now comparable in overhead with async sampling and, given the frequency of local safepoints, the remaining local safepoint bias is irrelevant for the vast majority of applications.

Let’s compare a real-world use case. A multithreaded Maven compilation for a medium-sized project was recorded with both JVMTI sampling and async sampling. The overhead is not measurably different, and the hot spot distribution is very similar. See the “Hot spots” view in JProfiler below, first for JVMTI sampling:

Sampling with JVMTI

… and then for async sampling:

Is the difference the remaining safepoint bias? Not necessarily so, and most probably not. There are two other important factors at play: First, async sampling measures threads that are actually running while JVMTI sampling measures if threads are scheduled for execution, that is if they are “runnable”. The JProfiler UI reflects this in the labelling of the thread status.

Second, async sampling operations do not work all the time. The technical “outages” are summed up at the bottom of the call tree without contributing to the total percentage:

These are substantial times and they do contribute some skew. So while async sampling measures actual CPU times, they are only partial CPU times and not a more useful measure than the runnable time measured by JVMTI sampling.

As an example of a feature that async sampling cannot provide, try to change the thread state to “Waiting”, “Blocking” or “Net I/O”. With async sampling, these are not available, unlike with JVMTI sampling.

When working with databases or REST services, having access to the “Net I/O” state is an invaluable benefit, though, because these are the times waiting for the external service.

What remains for async sampling? Async sampling can profile native stack traces, so if you are interested in that, it remains a useful tool. Other than that, full sampling is now by far the better alternative, in terms of data quality and available features, without compromising on overhead.

We invite you to try it out in the latest JProfiler release. Happy profiling!

How invokedynamic makes lambdas fast

Recently, we have been at work rewriting our website in Kotlin. Instead of a view technology that uses string templates with embedded logic, we now use the Kotlin HTML builder to develop views as pure Kotlin code. This has a number of advantages, like being able to easily refactor common code. Also, the performance of such views is much better than that of string templates, which contain interpreted code snippets.

When measuring the performance, we noticed that a lot of anonymous classes were created for our views and their loading time was significant. Code that uses the Kotlin HTML builder is very lambda-heavy and as of Kotlin 1.9, lambdas are implemented as anonymous classes. The JVM has a sophisticated mechanism to avoid creating classes at compile time that was introduced in Java 8 – the LambdaMetafactory and invokedynamic . The JVM developers also claimed that the performance would be better than anonymous classes. So why does Kotlin not use that?

As it turns out, Kotlin can optionally compile lambdas with invokedynamic in the same way that Java does, by passing -Xlambdas=indy to the Kotlin compiler. This has been supported since Kotlin 1.5 and will become the default in the upcoming Kotlin 2.0 release. The great thing about having both compilation strategies available, is that we can compare how anonymous classes and invokedynamic compare in a real-world example.

First of all, the number of classes for our entire website project was reduced by 60% (!) when compiling with -Xlambdas=indy . Here you can see the list of classes for our store view with both compilation modes:

For that particular view, the cold rendering time was improved by 20%. This was simply measured by wrapping the rendering with a measurement function. How about the warmed-up rendering time? The times there are much shorter, and we need to introduce some statistics by making many invocations. This is easily done with JProfiler and has the added benefit that we can also compare the internal call structure.

With the default compilation to anonymous classes, we recorded 50 invocations after warm-up and got 12.0 ms per rendering:

With invokedynamic compilation, the time per rendering was 10.6 ms:

This is an improvement of 12% which is surprisingly large. This test does not measure the difference between the invocation mechanisms in isolation, but it includes the actual work that is done to render the store view. Against that baseline duration, a speed-up of 12%, – just by changing the compilation mode for lambdas – is quite impressive. Many DSL-based libraries in Kotlin are lambda-heave, so other use cases may also produce similar numbers.

By looking at the call tree, we can see that the version with anonymous classes makes 3 calls, instead of one: First, it instantiates the anonymous class and passes all the captured parameters:

Then it calls a bridge method without the captured parameters, which in turn calls the actual implementation:

Looking at the bytecode, we can see that a number of instructions are required to store the captures parameters into fields, and the bridge method also contains instructions that add to the overhead.

With invokedynamic compilation, the generated lambda methods are in the same class:

This works because the lambda instances are created by invokedynamic calls to so-called bootstrap methods.

Bootstrap methods are structures in the class file that contain signature information for the lambda and a method handle reference to a static method. The LambdaMetafactory then efficiently creates an instance for the lambda.

This intricate mechanism makes lambda calls on the JVM as fast as they are – and from Kotlin 2.0 on this will be the default for Kotlin as well.

Improved Kubernetes authentication handling in JProfiler

Since version 13, JProfiler supports profiling on Kubernetes clusters with no extra configuration.

JProfiler 13.0.6 added an important improvement for profiling a JVM in Kubernetes clusters where authentication is set up in such a way that the authentication plugin prints instructions on stdout.

For example, when the Azure Kubernetes Service (AKS) is configured to authenticate with Active Directory (AD), users will be required to authenticate their identity using the kubelogin authentication plugin. The plugin will initiate a multi-factor authentication (MFA) process by prompting the user to open a URL and enter a one-time code. Previously, the information was not visible when JProfiler was making a connection to a Kubernetes cluster.

Since JProfiler 13.0.6, the progress dialog will be expanded if such output is detected. The command line output is then shown in a text area where text can be copied to the clipboard and links can be opened directly the system web browser:

After completing the instructions, the connection will be made and the pods in the cluster will be listed.

New web license service and improvements for the on-premises server

Customers with floating licenses now have more flexibility: Starting with with the most recent releases of JProfiler and install4j, we now offer a web license service, so you do not have to install a license server yourself. If you choose that option, you will receive a license key that can be distributed to all developers and is entered just like a single license key. This option requires the ability to make an outgoing HTTP request to our license server.

Going forward, we will be offering both the web as well as the the on-premises solution. For the time being, the on-premises option remains the default and you can contact us if you would like to switch to the web option.

(more…)

Support for macOS Apple Silicon

Please note: Several JDK providers now offer the macos-aarch64 architecture and there is no need anymore to create the bundle yourself, install4j can do this for you automatically.

(Edited on 2021-01-07 to include changes for install4j 8.0.10)

Apple machines with the new ARM architecture are now available. While you can run existing x64 binaries for on ARM machines through Rosetta, the performance may be impacted significantly. install4j 8.0.9 addresses this concern with support for native ARM binaries.

(more…)

Introducing perfino

Today we’re releasing a major new product: perfino is a JVM monitoring tool for in-production use. Over the years, we have lost count of the number of times that our customers have asked us on how to best deploy JProfiler in production. While our standard response was to recommend a monitoring tool, our customers were not so easily dissuaded. They wanted the power of JProfiler to solve their particular problems.

Out of this dilemma, the idea for perfino was born. Would it be possible to develop a monitoring tool that could be used in production, yet provide a way to escalate from monitoring to profiling if necessary? We are firmly convinced that perfino succeeds with respect to this original goal and provides you with a layered defence in depth. When a problem becomes more difficult to solve with monitoring techniques, perfino offers low-risk, low-overhead native JVMTI sampling to get a picture of the entire JVM. If even that is not enough, perfino offers an easy way to attach JProfiler to a problematic JVM. At that point, you have the full arsenal of a Java profiler at your disposal.

However, the much larger part of perfino is not its emergency handling, but its monitoring capabilities. Here, we wanted to make a difference as well. perfino uses a Java agent with ultra-low overhead and measures what is called “business transactions” in the APM space. Business transactions capture important method calls with specially constructed names that help you to interpret what is going on in your application.

For business transactions, we brought in successful concepts from the profiling space and integrated them into perfino. For example, transactions are shown in a call tree and you can see hot spots of transactions. With perfino, it is possible to define many transactions that are nested. This gives you more informational depth and correspondingly more insight than just the list of top-level business transactions that is common for APM tools.

The amount of useful information in an APM tool is directly related to the amount and quality of the recorded business transactions. This is why we expended a lot of energy on the business transaction engine and the configuration of business transactions in the perfino UI. Also, we wanted to make it really easy to define business transactions directly in your code. The DevOps annotations offered by perfino are a great way to achieve this. Rather than thinking about monitoring as external to the application, you just annotate methods of interest.

The features mentioned above rotate around measuring method calls. Of course, a monitoring tool needs to do a lot more and we’ve strived to make perfino great in all these aspects: Telemetries, policies, triggers, alerts, end user experience monitoring and lots more. Take a look at the feature list or – even better – try it out in our live demo or on your own machines. Tell us what you think and what you would like to see in future versions.

perfino is a powerful APM solution today, but our vision for perfino is not done yet. There are many more things to come and we hope you’ll bear with us.

All screen casts now with HTML5 video

We’ve just converted all our screen casts to HTML with MP4 and WebM codecs so you can enjoy them on mobile and other Flash-less devices.

There still is a Flash fallback for ancient browsers that do not support the “video” tag. Some older browsers (such as Firefox 3) that support the video-tag but do not support either the MP4 or the WebM video codec may show an error. In that case, please go to our youtube channel to watch the screen casts.

— Update 2013-07-24

Since Firefox 21, MP4 is supported on Firefox if you’re on Windows 7 or higher. There may be problems with colors that are resolved if you go to about:config and set

media.windows-media-foundation.use-dxva=false

Welcome!

In this blog we’ll show you tips and tricks around JProfiler and install4j. Comments and questions are always welcome. Enjoy!