Measure Java Performance – Sampling or Instrumentation

copy from https://blog.codecentric.de/en/2011/10/measure-java-performance-sampling-or-instrumentation/

In recent discussions, I noticed a confusion about the differences between measuring with Sampling andInstrumentation.
I hear about which one should be better than others, but I think it is important to understand how they work. Some tools even ask to choose between those two directly at startup, like JProfiler 7 in the picture on the right.
But how can you choose whichever fits best the given situation without understanding those concepts?

So lets step back for a moment and think about how to measure the performance of a computer program:
a) We add measurement code into the real code. The measurement code tells us how long the real code took to execute.
b) We add an external observer, which examines the running real code and tells us what code executed at any given time.

Both approaches will work and give results, however they are fundamentally different! So I am going to explain how they work.

The examples I will use in this post are based on Java and the JVM, but are applicable to other languages, especially .Net with its CLR as well.

So how can we in Java add measurement code?
It turns out there are actually a few established methods for measuring by adding measurement code:

  • Manually add some System.out.println code for important methods. Execution time is printed to the log.
  • Create some kind of javax.management beans to record time. They can recorded manually and queried with tools later on.
  • Use AOP Libraries to build Aspects which record code execution time.
  • Build an JVMTI Agent, which uses APIs to add code and record execution time.

Those methods overlap in one way or the other. In the end, all will have code on top of the real application code for calculating the time used to execute. The former approaches usually involve some kind of basic file logging or JMX (JSR-3JSR-250). JMX was designed to gather metrics for management and to be able to change settings. The latter methods are more dynamic and do not require hardcoding during development. Adding code like this is called “instrumentation” and usually involves bytecode modification.

How can we observe externally in Java?
Ideally we would like to observe from outside the runtime (the JVM). JMX was made accessible externally inJSR-160, but JMX prevents us from seeing many details, as it is just high level. While there are other APIs that allow us reading the state of the JVM, none really tells us how fast code executes. To do better pseudo-external observation, we create a thread, which has this observing duty, just inside the monitored JVM. That thread will look from time to time into the other threads and record their activity. The interval between those inspections should be small enough to capture many details. This external observation is called “(time-)sampling”. With timesampling, the monitoring is not continous, but it does cover all requests or threads.

In this post, I am going to compare both approaches using an easy to understand example. Because it is designed to be easily understandable, it uses naive code and does not contain optimizations.

Example code

So first off, here is the code:
Attached as ZIP, or SamplingVsInstrumentation on my private GitHub.

We have a class Demo, which running all our fake production code. It has a few methods called like this: method100ms()
The name includes the average execution time to allow easier readings of the results. Unfortunately in real code no method name will have this information 
There is a method0ms(), which does some minor code execution, so it is not zero milliseconds, but much faster than one millisecond.

DemoRunner has two methods for executing Demo class business methods;
a) Mixed Demo will run Demo with [100, 1, 100, 500, 1, 100, 1, 50, 50] methods
b) Mass Demo will run the 0ms method a hundred million times.

So we can run it like this:

public static void main(final String[] args) {
	mixedDemo();
	massDemo();
}

And it will complete. But we do not know anything about it. We can use an external tool to get some kind of result: A stopwatch.
On my machine (Dell E6420, Intel 2520 2,5Ghz 4 Core CPU, 64bit Win, Java 1.6.0_27) it takes about a second to run the mixed demo and almost three seconds for the plenty 0ms method invocations.

So let us add some outside measurement code to get more precise numbers:

public static void main(final String[] args) {
	long start = System.currentTimeMillis();
	mixedDemo();
	long end = System.currentTimeMillis();
	System.out.printf("%s Demo completed in %dms%n", DemoType.MIXED, end - start);
 
	start = System.currentTimeMillis();
	massDemo();
	end = System.currentTimeMillis();
	System.out.printf("%s Demo completed in %dms%n", DemoType.MASS, end - start);
}

Which gives us:

Running Demo with [100, 1, 100, 500, 1, 100, 1, 50, 50] methods
MIXED Demo completed in 967ms
Running Demo with 100000000 0ms methods
MASS Demo completed in 2781ms

Lets talk about Overhead
When measuring, you will distort the results. Generally speaking, the measured value will differ from the true value by so called systematic error and a random error. Systematic errors are introduced by the measurement instruments and measured or estimated to a certain extend, while random errors cannot be predicted.
When the CPU executes measuring code instead of real code we usually speak of “overhead”, which results in systematic errors in measurements. It also consumes CPU cycles which could have been used by other production code and can as such influence also unmeasured code behaviour. Additionally, the real important overhead is the delay othe regular transactions through the system. Additional system ressource usage can usally be tolerated.

Instrumentation

To better measure what the Demo code is doing, I will build an instrumentation agent based on AOP withaspectj loadtime weaving. This will add some extra code invocation to some methods I specify with a so called “pointcut expression”.
aspectj will enhance the bytecode of classes when they are loaded. The pointcut expression describes the signature of the method which shall be instrumented.
In this case an @Around advice is used, which is passed the so called JoinPoint, which is actually a pointer to the real code that was about to be executed. Aspectj uses an JVMTI Agent and does the hard work for me. I just have to write a so called Aspect to do my measurements.

The interesting part of the code is this:

@Around("   call(void de.codecentric.performance.Demo.method* (..)) ")
public void aroundDemoMethodCall(final ProceedingJoinPoint thisJoinPoint) throws Throwable {
	long start = System.currentTimeMillis();
	thisJoinPoint.proceed();
	long end = System.currentTimeMillis();
	String currentMethod = thisJoinPoint.getSignature().toString();
	if (executionPath.size() < MAX_EXECUTION_PATH) {
		executionPath.add(currentMethod);
	}
	MethodStatistics statistics = methodStatistics.get(currentMethod);
	if (statistics == null) {
		statistics = new MoreMethodStatistics(currentMethod);
		methodStatistics.put(currentMethod, statistics);
	}
	statistics.addTime(end - start);
	overhead += System.currentTimeMillis() - end;
}

As you can see, I give explicit method names to intercept: call(void de.codecentric.performance.Demo.method* (..)).
I record start time at the beginning, and endtime after executing the method. Additionally I store the current method name in “execution path” (unless it reached its maximum), and I record method statistics for the current method. I also record how much time I spent recording this data in a field called “overhead”.

Running this Instrumentation gives me this:

Running Demo with [100, 1, 100, 500, 1, 100, 1, 50, 50] methods
MIXED Demo completed in 950ms
Trace Aspect recorded following results:
  void de.codecentric.performance.Demo.method500ms() 501ms (min: 501ms, max: 501ms) - 1 invocations
  void de.codecentric.performance.Demo.method100ms() 303ms (min: 101ms, max: 101ms) - 3 invocations
  void de.codecentric.performance.Demo.method50ms() 102ms (min: 51ms, max: 51ms) - 2 invocations
  void de.codecentric.performance.Demo.method1ms() 6ms (min: 2ms, max: 2ms) - 3 invocations
Code Execution Path:
  void de.codecentric.performance.Demo.method100ms()
  void de.codecentric.performance.Demo.method1ms()
  void de.codecentric.performance.Demo.method100ms()
  void de.codecentric.performance.Demo.method500ms()
  void de.codecentric.performance.Demo.method1ms()
  void de.codecentric.performance.Demo.method100ms()
  void de.codecentric.performance.Demo.method1ms()
  void de.codecentric.performance.Demo.method50ms()
  void de.codecentric.performance.Demo.method50ms()
Agent internal Overhead 2ms
Agent Overhead 91ms
 
Running Demo with 100000000 0ms methods
MASS Demo completed in 7261ms
Trace Aspect recorded following results:
void de.codecentric.performance.Demo.method0ms() 2892ms (min: 0ms, max: 2ms) - 100000000 invocations
Code Execution Path:
  void de.codecentric.performance.Demo.method0ms()
  void de.codecentric.performance.Demo.method0ms()
  [...]
  void de.codecentric.performance.Demo.method0ms()
  void de.codecentric.performance.Demo.method0ms()
Execution Path incomplete!
Agent internal Overhead 2836ms
Agent Overhead 4ms

We can clearly see, the instrumentation caught all the different 8 method invocations in the first example and did quite accurately record the time spent. It can tell us also in which order these methods executed. But it has a problem, as the second output shows us. The execution path was very long. One hundred million executions the aspect would need to keep in memory. Thats why I put in a limit there.

So what about Overhead?

There are two kinds of overhead measured by my demo code. Both are not really accurate, but give good indication where time is spent by the agent.
The internal one counts the time spent the agent doing the statistics. It is internal because it cannot be differentiated externally and looks like time the actual business method takes to execute. And there is the overhead which can be seen externally. This is the time required to set up the instrumentation and to print the results.
We can see that the overhead of instrumentation is low in the first case, but outputting the data to standard out took some time. In the second demo, the output was faster, because there was less data, but the internal overhead was huge. However there is a problem with the overall overhead. The internal overhead differs from the total time minus method time. 7261ms – 2892ms = 4369 ms of the time was not spent in running real code. But the agent only claims to account for 2836ms. The delta is accounted on inaccuracy both of the external and internal time measures.And of course there is some code execution inside the instrumentation which is not added to the overhead time (like the method invocation cost ofaroundDemoMethodCall(JoinPoint thisJoinPoint))

Sampling

My sampling code will create a daemon thread, which will look into the main thread every 10ms and track the activity. The interesting code of that agent is this:

@Override
public void run() {
	lastSample = System.currentTimeMillis();
	while (true) {
		try {
			Thread.sleep(interval);
		} catch (InterruptedException e) {
			Thread.currentThread().interrupt();
		}
 
		String currentMethod = getCurrentMethod();
		long currentSample = System.currentTimeMillis();
 
		addMeasurementsIfStillInMethod(currentMethod, currentSample);
 
		lastMethod = currentMethod;
		lastSample = currentSample;
 
		overhead += System.currentTimeMillis() - currentSample;
	}
}
 
private void addMeasurementsIfStillInMethod(final String currentMethod, final long currentSample) {
	if (currentMethod.equals(lastMethod)) {
		MethodStatistics statistics = methodStatistics.get(currentMethod);
		if (statistics == null) {
			statistics = new MethodStatistics(currentMethod);
			methodStatistics.put(currentMethod, statistics);
		}
		statistics.addTime(currentSample - lastSample);
	} else {
		if (executionPath.size() < MAX_EXECUTION_PATH) {
			executionPath.add(getParentMethod() + " > " + currentMethod);
		}
	}
}
 
private String getCurrentMethod() {
	StackTraceElement topOfStack = monitoredThread.getStackTrace()[0];
	return formatStackElement(topOfStack);
}

So the agent will sleep its given interval, wake up and find out what method is executing on the monitored thread using monitoredThread.getStackTrace()[0]. Then it records the current time to find out how long it has been sleeping since the last sample (this will be likely around 10ms but might differ!). Then it will find out if the code is still in the same method as the last time. This is quite important, because the agent can only record the execution time when the same method is seen back-to-back. If the method is seen the first time, we add it to the execution path (also respecting the same limit here). Then we update internal states and calculate overhead for the statistics part.

Agent monitoring thread main with sampling interval of 10ms
Running Demo with [100, 1, 100, 500, 1, 100, 1, 50, 50] methods
MIXED Demo completed in 916ms
Agent stopped - Results:
  void de.codecentric.performance.Demo.method500ms() 488ms
  void de.codecentric.performance.Demo.method100ms() 285ms
  void java.lang.Thread.sleep() 101ms
Code Execution Path:
  void de.codecentric.performance.Demo.runCode() > void de.codecentric.performance.Demo.method100ms()
  void de.codecentric.performance.Demo.runCode() > void de.codecentric.performance.Demo.method500ms()
  void de.codecentric.performance.Demo.runCode() > void de.codecentric.performance.Demo.method100ms()
  void de.codecentric.performance.Demo.method50ms() > void java.lang.Thread.sleep()
Agent internal Overhead 4ms
Agent Overhead 36ms
Agent monitoring thread main with sampling interval of 10ms
 
Running Demo with 100000000 0ms methods
MASS Demo completed in 2959ms
Agent stopped - Results:
  void de.codecentric.performance.Demo.method0ms() 2736ms
Code Execution Path:
  void de.codecentric.performance.DemoRunner.massDemo() > void de.codecentric.performance.DemoRunner.massDemo()
  void de.codecentric.performance.Demo.runCode() > void de.codecentric.performance.Demo.method0ms()
Agent internal Overhead 0ms
Agent Overhead 0ms

So we can clearly see that sampling had problems capturing the 1 millisecond methods. But we see aThread.sleep(), which we have not seen with instrumentation. Because the sampling has much easier access to previous executing method using monitoredThread.getStackTrace()[1], we discover that it is method50ms which is invoking thread sleep. But the execution path is missing a few short invocations. The invocations of 100ms, 1ms and 100ms are seen as one about 200ms long invocation of the method called 100ms. It kind of automatically filters the performance wise irrelevant 1ms execution, so this chain is presented as 200ms execution of method100ms. This is mainly due to the fact that the agent will not see code which returns faster than the sampling interval. When doing sampling, there are other aspects to consider in respect to the sampling interval. A good paper on that topic is: “Evaluating the Accuracy of Java Profilers

As we can see, sampling gives the expected results on the second demo code without problems with the executionpath length.

Overhead again

So we can see that in the first example, the overhead is quite similar to instrumentation. But in the second example the internal overhead is drastically lower. We only miss 223ms (2959ms – 2736ms) but this seems not to be caused by our internal measures. Also this agent runs in a different thread, which makes some overhead not resulting in longer execution time of the real code. And this would utilize multiple cores easily.

Conclusion

Both, instrumentation and sampling, have different characteristics with their own advantages and disadvantages.
They are caused by the fundamental difference and can be mitigated to some extend by clever construction of the agents but never removed.

Instrumentation

  • Has access to invocation count, precise actual/max/min/avg execution times and precise invocation order.
  • Needs to have configuration on what methods to instrument. Instrumentation needs to be balanced to exclude mass invocations or invocations where the measuring code outweighs the measured code.
  • Generally has much more data to process.

Sampling

    • Stable overhead, mainly determined by sampling interval, not by measured code.
    • Execution Hot Spots are shown instead of fine granular execution path and time.
    • Can discover unknown code.
    • Runs easily on separate core.
时间: 2024-10-11 09:17:19

Measure Java Performance – Sampling or Instrumentation的相关文章

Java Performance Optimization Tools and Techniques for Turbocharged Apps--reference

Java Performance Optimization by: Pierre-Hugues Charbonneau reference:http://refcardz.dzone.com/refcardz/java-performance-optimization Java is among the most widely used programming languages in the software development world today. Java applications

9 tools to help you with Java Performance Tuning

9 tools to help you with Java Performance Tuning Java Performance Tuning tools to help you improve your Java Code Previously I wrote an article about 5 tools to help you write better java code which helped to improve our code but also our productivit

Java Performance Companion

出版时间:2016.5 下载:百度网盘 内容简介: Java® Performance Companion shows how to systematically and proactively improve Java performance with today’s advanced multicore hardware and complex operating system environments. The authors, who are all leading Java perfo

老李分享:《Java Performance》笔记1——性能分析基础 1

老李分享:<Java Performance>笔记1——性能分析基础 1.性能分析两种方法: (1).自顶向下: 应用开发人员通过着眼于软件栈顶层的应用,从上往下寻找性能优化的机会. (2).自底向上: 性能专家从软件栈底层的CPU统计数据(例如CPU高速缓存未命中率.CPU指令效率)开始,逐渐上升到应用自身的结构或应用常见的使用方式. 2.CPU使用率: 大多数操作系统的CPU使用率分为用户态CPU使用率和系统态CPU使用率. 用户态CPU使用率:执行应用程序代码的时间占总CPU时间的百分比

Java性能监控之Instrumentation

注:网上摘取的资料整理出来,供大家学习理解,希望有所帮助. 1.1.      Instrumentation 简介 利用 Java 代码,即 java.lang.instrument 做动态 Instrumentation 是 Java SE 5 的新特性,它把 Java 的 instrument 功能从本地代码中解放出来,使之可以用 Java 代码的方式解决问题.使用 Instrumentation,开发者可以构建一个独立于应用程序的代理程序(Agent),用来监测和协助运行在 JVM 上的

[Java Performance] Java垃圾回收简介

本系列作为Java Performance:The Definitive Guide的读书笔记. 概览 在目前的JVM中,主要有4中垃圾回收器(Garbage Collector): 串行回收器(Serial Collector),主要用于单核计算机 吞吐量(并行)回收器(Throughput/Parallel Collector) 并发回收器(Concurrent/CMS Collector) G1回收器 它们的性能特点各不相同,具体会在下一章进行介绍.但是它们拥有一些共同的原理和概念,这一章

Measure Graphics Performance

测量的图形性能Measure Graphics Performance 在iOS应用程序中广泛使用图形可以使你的应用脱颖而出.但是,除非你负责任地使用图形资源,否则你的应用程序会慢下来,降低电池寿命,看起来平庸,不管你想渲染的内容有多好.最佳图形性能:Extensive use of graphics in your iOS app can make your app stand out from your competitors. But unless you use graphics res

[Java Performance] JIT编译器简介

使用JIT(Just-In-Time)编译器 JIT编译器概览 JIT编译器是JVM的核心.它对于程序性能的影响最大. CPU只能执行汇编代码或者二进制代码,所有程序都需要被翻译成它们,然后才能被CPU执行. C++以及Fortran这类编译型语言都会通过一个静态的编译器将程序编译成CPU相关的二进制代码. PHP以及Perl这列语言则是解释型语言,只需要安装正确的解释器,它们就能运行在任何CPU之上.当程序被执行的时候,程序代码会被逐行解释并执行. 编译型语言的优缺点: 速度快:因为在编译的时

[Java Performance] 线程及同步的性能 - 线程池/ThreadPoolExecutors/ForkJoinPool

线程池和ThreadPoolExecutors 虽然在程序中可以直接使用Thread类型来进行线程操作,但是更多的情况是使用线程池,尤其是在Java EE应用服务器中,一般会使用若干个线程池来处理来自客户端的请求.Java中对于线程池的支持,来自ThreadPoolExecutor.一些应用服务器也确实是使用的ThreadPoolExecutor来实现线程池. 对于线程池的性能调优,最重要的参数就是线程池的大小. 对于任何线程池而言,它们的工作方式几乎都是相同的: 任务被投放到一个队列中(队列的