ARM NEON Optimization Example

ARM NEON Optimization. An Example

Http://hilbert-space.de/?p=22

Since there is so little information about NEON optimizations out there I thought I’d write a little about it.

Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven’t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.

For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It’s just an example after all.

First a reference implementation in C:

void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
  int i;
  for (i=0; i<n; i++)
  {
    int r = *src++; // load red
    int g = *src++; // load green
    int b = *src++; // load blue

// build weighted average:
    int y = (r*77)+(g*151)+(b*28);

// undo the scale by 256 and write to memory:
    *dest++ = (y>>8);
  }
}

Optimization with NEON Intrinsics
Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I’ll show you later..

Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
  int i;
  uint8x8_t rfac = vdup_n_u8 (77);
  uint8x8_t gfac = vdup_n_u8 (151);
  uint8x8_t bfac = vdup_n_u8 (28);
  n/=8;

for (i=0; i<n; i++)
  {
    uint16x8_t  temp;
    uint8x8x3_t rgb  = vld3_u8 (src);
    uint8x8_t result;

temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

result = vshrn_n_u16 (temp, 8);
    vst1_u8 (dest, result);
    src  += 8*3;
    dest += 8;
  }
}

Lets take a look at it step by step:

First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register.

uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28);

Now I load 8 pixels at once into three registers.

uint8x8x3_t rgb  = vld3_u8 (src);

The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop.

After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third.

Now calculate the weighted average:

temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair.

vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.

So we end up with just three instructions for weighted average of eight pixels. Nice.

Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.

result = vshrn_n_u16 (temp, 8);

And finally store the result.

vst1_u8 (dest, result);

First Results:
How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings:

C-version:       15.1 cycles per pixel.
NEON-version:     9.9 cycles per pixel.That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:

160:   f46a040f        vld3.8  {d16-d18}, [sl]
 164:   e1a0c005        mov     ip, r5
 168:   ecc80b06        vstmia  r8, {d16-d18}
 16c:   e1a04007        mov     r4, r7
 170:   e2866001        add     r6, r6, #1      ; 0x1
 174:   e28aa018        add     sl, sl, #24     ; 0x18
 178:   e8bc000f        ldm     ip!, {r0, r1, r2, r3}
 17c:   e15b0006        cmp     fp, r6
 180:   e1a08005        mov     r8, r5
 184:   e8a4000f        stmia   r4!, {r0, r1, r2, r3}
 188:   eddd0b06        vldr    d16, [sp, #24]
 18c:   e89c0003        ldm     ip, {r0, r1}
 190:   eddd2b08        vldr    d18, [sp, #32]
 194:   f3c00ca6        vmull.u8        q8, d16, d22
 198:   f3c208a5        vmlal.u8        q8, d18, d21
 19c:   e8840003        stm     r4, {r0, r1}
 1a0:   eddd3b0a        vldr    d19, [sp, #40]
 1a4:   f3c308a4        vmlal.u8        q8, d19, d20
 1a8:   f2c80830        vshrn.i16       d16, q8, #8
 1ac:   f449070f        vst1.8  {d16}, [r9]
 1b0:   e2899008        add     r9, r9, #8      ; 0x8
 1b4:   caffffe9        bgt     160Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register.

What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler .

NEON and assembler
Since the compiler can’t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all.

convert_asm_neon:

# r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:

push        {r4-r5,lr}
      lsr         r2, r2, #3

# build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5

.loop:

# load 8 pixels:
      vld3.8      {d0-d2}, [r1]!

# do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5

# shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!

subs        r2, r2, #1
      bne         .loop

pop         { r4-r5, pc }

Final Results:
Time for some benchmarking again. How does the hand-written assembler version compares? Well – here are the results:

C-version:       15.1 cycles per pixel.
  NEON-version:     9.9 cycles per pixel.
  Assembler:        2.0 cycles per pixel.That’s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn’t even optimized the assembler loop.

My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler.

Btw: Sorry for the ugly syntax-highlighting. I’m still looking for a nice wordpress plug-in.

原文地址:https://www.cnblogs.com/LearnFromNow/p/9556124.html

时间: 2024-08-06 17:02:35

ARM NEON Optimization Example的相关文章

[arm neon] data type convert (float &lt;-&gt; int32_t)

The conclusion is: NEON intrinsics support converting a float data type to int32 regarding a Q value. The reverse is also supported. Neon intrinsics already do the clip operation to float value outside of [-1, 1]:!! The issue cost 1 Cycles per 32x2 o

NEON简介

"ARM Advanced SIMD",nick-named "NEON", it provides:(1).A set of interesting scalar/vectorinstructions and registers(the latter are mapped to the same chip area as theFPU ones), comparable to MMX/SSE/3DNow! in the 86 world;(2).VFPv3-D32

NEON简介【转】

转自:http://blog.csdn.net/fengbingchun/article/details/38020265 版权声明:本文为博主原创文章,未经博主允许不得转载. “ARM Advanced SIMD”,nick-named “NEON”, it provides:(1).A set of interesting scalar/vectorinstructions and registers(the latter are mapped to the same chip area a

Secrets of Android.mk[转]

Intro to Android.mk Simple example NDK Usage Defining Modules Simple APK APK Dependent on static .jar file APK signed with the platform key APK that signed with vendor key Prebuilt APK Adding a Static Java Library Android.mk variables Introduction to

Android.mk官方说明 中文翻译

转载请注明出处:http://blog.csdn.net/qq_15650553/article/details/51548025 译者注:第一次做这样的翻译,自己感觉还是很多不足,有些概念没有很好的理解,所以翻译过来的中文可能也会有问题.这篇文章主要是用来记录自己的学习所得.所以网友们若要根据下面的译文来学习,不是不行,但是请谨慎. 这篇文章描述了Android.mk文件的语法,这个文件将Android NDK与你的c/c++代码联系到一起. Overview 这个Android.mk文件存在

Android.mk和Application.mk文件语法规范说明及举例

以下英文内容摘自:http://www.kandroid.org/ndk/docs/OVERVIEW.htmlThe Android NDK is a set of tools that allows Android application developers to embed native machine code compiled from C and/or C++ source files into their application packages. NDK development

Android.mk 文件语法指南

1 前言 本文档描述了c和c++编写的程序用Android NDK编译时,编译文件Android.mk的语法结构.为了便于理解下面的内容,假设你已经阅读了前面OVERVIEW部分,了解了它们的作用和用法. 2 概要 Android.mk是用来描述源文件是如何进行编译的.更具体的:-Android.mk实际上是一个轻量级的Makefile,它会被编译系统解析一次或多次.因此,你应该尽可能少的声明变量,同时不要假定在解析过程中没有定义任何东西.-Android.mk是用来允许你将源文件组织在一个'm

NDK配置文件Android.mk简介

简介 android.mk主要描述了c或者c++文件时如何在ndk工程中被使用的,该小节主要描述了android.mk的构建规则 概览 android.mk文件描述了你的源码是如何构建的,主要包括:该文件实际上是一个简化了的GNU makefile文件.该文件被构建系统解析一次或多次,因此你需要尽可能少得自定义变量.同样的,也不能在解析过程中认为未定义任何变量 该文件语法决定了如何把你的源码组织到"模块"中,"模块"的概念是: 静态库 动态库 可执行文件 编译器仅仅

几种图片格式的简介

http://blog.ibireme.com/2015/11/02/mobile_image_benchmark/ 几种图片格式的简介 首先谈一下大家耳熟能详的几种老牌的图片格式吧: JPEG 是目前最常见的图片格式,它诞生于 1992 年,是一个很古老的格式.它只支持有损压缩,其压缩算法可以精确控制压缩比,以图像质量换得存储空间.由于它太过常见,以至于许多移动设备的 CPU 都支持针对它的硬编码与硬解码. PNG 诞生在 1995 年,比 JPEG 晚几年.它本身的设计目的是替代 GIF 格