linux下的性能分析profiling(动态)

Profiling is an alternative to benchmarking that is often more effective, as it gives you more fine grained measurements for the components of the system you‘re measuring, thus minimising external influences from consideration. It also gives the relative cost of various components, further discounting external influence.

As a consequence of giving more fine grained information for a component, profiling is really just a special case of monitoring and often uses the same infrastructure. As systems become more complex, it‘s becoming increasingly important to know what monitoring tools are available. By the way, being able to drill down into software components as I‘ll describe below, is a large advantage that open systems have over closed ones.

GNU/Linux profiling and monitoring tools are currently progressing rapidly, and are in some flux, but I‘ll summarise the readily available utils below.

System wide profiling

The Linux kernel has recently implemented a very useful perf infrastructure for profiling various CPU and software events. To get the perf command, install linux-tools-common on ubuntu, linux-base on debian, perf-utils on archlinux, or perf on fedora. Then you can profile the system like:

$ perf record -a -g sleep 10  # record system for 10s
$ perf report --sort comm,dso # display report

That will display this handy curses interface on basically any hardware platform, which you can use to drill down to the area of interest.

See Brendan Gregg‘s perf examples for a more up to date and detailed exploration of perf‘s capabilities.

Other system wide profiling tools to consider are sysprof and oprofile.

It‘s worth noting that profiling can be problematic on x86_64 at least, due to -fno-omit-frame-pointer being removed to increase performance, and 32 bit fedora at least may be going the same way.

Application level profiling

One can use perf to profile a particular command too, with a variant of the above, like perf record -g $command, or see Ingo Molnar‘s example of using perf to analyze the string comparison bottneck in `git gc`. There are other useful userspace tools available though. Here is an example profiling ulc_casecoll, where the graphical profile below is generated using the following commands.

valgrind --tool=callgrind ./a.out
kcachegrind callgrind.out.*

Note kcachegrind is part of the "kdesdk" package on my fedora system, and can be used to read oprofile data (mentioned above) or profile python code too.

Profiling hardware events

I‘ve detailed previously how important, efficient use of the memory hierarchy is for performance. Newer CPUs are providing counters to help tune your use of this hierarchy, and the previously mentioned Linux perf tools, expose this well. Unfortunately my pentium-m laptop doesn‘t expose any cache counters, but the following example from Ingo Molnar, shows how useful this technique can be.

static char array[1000][1000];

int main (void)
{
  int i, j;

  for (i = 0; i < 1000; i++)
    for (j = 0; j < 1000; j++)
       array[j][i]++;

  return 0;
}

On hardware that supports enumerating cache hits and misses, you can run:

$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u   -e l1-dcache-load-misses:u ./a.out

Performance counter stats for ‘./a.out‘ (10 runs):

        6,719,130 cycles:u                   ( +-   0.662% )
        5,084,792 instructions:u           #      0.757 IPC     ( +-   0.000% )
        1,037,032 l1-dcache-loads:u          ( +-   0.009% )
        1,003,604 l1-dcache-load-misses:u    ( +-   0.003% )

       0.003802098  seconds time elapsed   ( +-  13.395% )

Note the large ratio of cache misses.
Now if we change array[j][i]++; to array[i][j]++; and re-run perf-stat:

$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u   -e l1-dcache-load-misses:u ./a.out

Performance counter stats for ‘./a.out‘ (10 runs):

        2,395,407 cycles:u                   ( +-   0.365% )
        5,084,788 instructions:u           #      2.123 IPC     ( +-   0.000% )
        1,035,731 l1-dcache-loads:u          ( +-   0.006% )
            3,955 l1-dcache-load-misses:u    ( +-   4.872% )

       0.001806438  seconds time elapsed   ( +-   3.831% )

We can see the L1 cache is much more effective.
To identify hot spots to concentrate on you can use:

$ perf top -e l1-dcache-load-misses -e l1-dcache-loads

   PerfTop:    1923 irqs/sec  kernel: 0.0%  exact:  0.0% [l1-dcache-load-misses...
----------------------------------------------------------------------------------

   weight    samples  pcnt funct DSO
   ______    _______ _____ _____ ______________________

      1.9       6184 98.8% func2 /home/padraig/a.out
      0.0         69  1.1% func1 /home/padraig/a.out

Specialised profiling

system entry points

  • strace -c $cmd
  • ltrace -c $cmd

heap memory

I/O

GCC

misc

时间: 2024-11-09 05:00:27

linux下的性能分析profiling(动态)的相关文章

linux服务器的性能分析与优化(十三)

[教程主题]:1.linux服务器的性能分析与优化 [主要内容] [1]影响Linux服务器性能的因素 操作系统级 Ø CPU 目前大部分CPU在同一时间只能运行一个线程,超线程的处理器可以在同一时间处理多个线程,因此可以利用超线程特性提高系统性能. 在linux系统下只有运行SMP内核才能支持超线程,但是安装的CPu数量越多,从超线程获得的性能提升越少. 另外linux内核会将多核的处理器当做多个单独的CPU来识别,例如,两个4核的CPU会被当成8个单个CPU,从性能角度讲,两个4核的CPU整

1.linux服务器的性能分析与优化

[教程主题]:1.linux服务器的性能分析与优化 [课程录制]: 创E [主要内容] [1]影响Linux服务器性能的因素 操作系统级 CPU 目前大部分CPU在同一时间只能运行一个线程,超线程的处理器可以在同一时间处理多个线程,因此可以利用超线程特性提高系统性能. 在linux系统下只有运行SMP内核才能支持超线程,但是安装的CPu数量越多,从超线程获得的性能提升越少. 另外linux内核会将多核的处理器当做多个单独的CPU来识别,例如,两个4核的CPU会被当成8个单个CPU,从性能角度讲,

Linux下Java性能监控

Linux下Java性能监控 一.JVM堆内存使用监控 获取thread dump的3种方法: 1)使用$JAVA_HOME/bin/jcosole中的MBean,到MBean>com.sun.management>HotSpotDiagnostic>操作>dumpHeap中,点击 dumpHeap按钮.生成的dump文件在java应用的根目录下面. 2)jmap -heap 1234 (1234为进程号) 3)cmd ->jvisualvm,远程连接,选择堆Dump生成he

Oracle在Linux下的性能优化

Oracle数据库内存参数的优化 Ø       与oracle相关的系统内核参数 Ø       SGA.PGA参数设置   Oracle下磁盘存储性能优化 Ø       文件系统的选择(ext2/ext3.xfs.ocfs2) Ø       Oracle ASM存储  1.优化oracle性能参数之前要了解的情况 1)物理内存有多大 2)操作系统估计要使用多大内存 3)数据库是使用文件系统还是裸设备 4)有多少并发连接 5)应用是OLTP类型还是OLAP类型 2.oracle数据库内存参

linux下编译原理分析

linux下编译hello.c 程序,使用gcc hello.c,然后./a.out就可以运行:在这个简单的命令后面隐藏了许多复杂的过程,这个过程包括了下面的步骤: ====================================================================================== 预处理: 宏定义展开,所有的#define 在这个阶段都会被展开 预编译命令的处理,包括#if #ifdef 一类的命令 展开#include 的文件,像上面h

通过/proc/sys/net/ipv4/优化Linux下网络性能

通过/proc/sys/net/ipv4/优化Linux下网络性能 /proc/sys/net/ipv4/优化1)      /proc/sys/net/ipv4/ip_forward该文件表示是否打开IP转发.0,禁止1,转发 缺省设置:02)      /proc/sys/net/ipv4/ip_default_ttl   该文件表示一个数据报的生存周期(Time To Live),即最多经过多少路由器.   缺省设置:64 增加该值会降低系统性能. 3)      /proc/sys/ne

linux下的性能查询命令

(1)查看各个CPU核的使用情况 sudo top -d 1 进入之后,按1,会出现下面的CPU使用情况,其中us列反映了各个CPU核的使用情况,百分比大说明该核在进行紧张的任务. (2)查看哪个进程在哪个CPU核上运行 sudo top -d 1 进入之后,依次按f.j和空格,会出现如下(其中P列指示的是该进程最近使用的CPU核,如进程mencoder的P列为7,则表示mencoder最近在核7上运行,对于多线程甚至单线程的进程,在不同时刻会使用不同的CPU Core): (3)vmstat查

linux下面的性能分析工具简介

iostat 命令详解 iostat用于输出cpu和磁盘I/O相关的统计信息.命令格式: Usage: iostat [ options ] [ <interval> [ <count> ] ] Options are: [ -c ] [ -d ] [ -N ] [ -n ] [ -h ] [ -k | -m ] [ -t ] [ -V ] [ -x ] [ -y ] [ -z ] [ -j { ID | LABEL | PATH | UUID | ... } [ <devi

Linux下apache日志分析与状态查看方法

假设apache日志格式为:118.78.199.98 – - [09/Jan/2010:00:59:59 +0800] “GET /Public/Css/index.css HTTP/1.1″ 304 – “http://www.a.cn/common/index.php” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.3)” 问题1:在apachelog中找出访问次数最多的10个IP.awk '{print $1}