The Improvement of Reduction Summation on GPU Using CUDA

We can never be satisfied with the program just only running correctly.The reduction summation program described in previous blog post needs to be optimized.

1.make the best use of  hardware and do not forget CPUs!

During the second part in the reduction Summation, the amount of  data to be calculated has been greatly reduced when the second kernel function runs.At this time, it equals to the number of  threads per block.Given the difference in architecture between the CPU and GPU device, CPUs are designed for running a small number of  potentially quite complex tasks,while GPUs are designed for running a large number of  potentially quite simple tasks. When you have a small amount of data,don’t forget CPUs,which can provide much faster computing rate than GPUs.

We can delete the second kernel function,pass the partial sum within every block to CPU and add them on the CPU.

cudaMemcpy(a,dev_a,N*sizeof(int),cudaMemcpyDeviceToHost);

int c=0;
for(int i=0;i<BlockPerGrid;i++)
{
     c=+a[i];
}

2.the appropriate number of threads per block: not the more,the better.

As we all know, if there are too few threads ,GPUs can’t hide memory latency using the capacity to handle data.Therefore we’d better not choose too few threads.

However,as we consider  the number of threads,it will make a difference when we have synchronization points in the kernel.The number of threads per block not the more,the better.

The time to execute a given block is undefined.A block cannot be retired from an SM until it’s completed its entire execution.Sometimes,all other idle warps are waiting for a single warp to complete,making the SM also idle.

It follows that the larger the thread block,the more potential to wait for a slow warp to catch up. As a general,the value 256 gets you 100% utilization across all levels of the hardware.We had better aim for either 192 or 256.Or you can look up the table of utilization and select the smallest number of threads that gives the highest device utilization.

3.not too much more branches

As the hardware can only fetch a single instruction stream per warp and if branches appear, some of the threads that don’t meet the condition will  stall,making the device utilization rate decrease.However, the actual scheduler in terms of  instruction execution is half-warp based,not warp based.Therefore we can arrange the divergence to fall on a half warp (16-thread) boundary,then it can execute both sides of the branch  condition.

if((thread_idx%32)<16)
{
     do something;
}
else
{
    do something;
}

However,it just happens when data across memory is continuous.Sometimes we can supplement with zeros behind the array,just like the previous blog mentioned,to a standard length of the integral multiple of 32.That can help you keep the number of branches to a minimum.

时间: 2024-08-03 20:37:11

The Improvement of Reduction Summation on GPU Using CUDA的相关文章

win10 用cmake 3.5.2 和 vs 2015 update1 编译 GPU版本(cuda 8.0, cudnn v5 for cuda 8.0)

win10 用cmake 3.5.2 和 vs 2015 update1 编译 GPU版本(cuda 8.0, cudnn v5 for cuda 8.0)  用vs 2015打开 编译Release和Debug版本 看网上那个例子里面 工程里面有是三个文件夹 include(包含mxnet,dmlc,mshadow的include目录) lib(包含libmxnet.dll, libmxnet.lib,把用vs编译好的放过去) python(包含一个mxnet,setup.py, 以及buil

Install Tensorflow GPU with CUDA 10.1 for python on Windows

How to install Tensorflow GPU with CUDA 10.0 for python on Windows 在cuda 10.0的windows上安装Tensorflow GPU, python ref: https://www.pytorials.com/how-to-install-tensorflow-gpu-with-cuda-10-0-for-python-on-windows/ Before start, Notations neet to know: Ta

CUDA系列学习(一)An Introduction to GPU and CUDA

本文从软硬件层面讲一下CUDA的结构,应用,逻辑和接口.分为以下章节: (一).GPU与CPU (二).CUDA硬件层面 (三).CUDA安装 (四).CUDA 结构与接口 4.1 Kernels 4.2 Thread,Block, Grid 4.3 Memory 4.4 Execution (五).码HelloWorld--数组求和 希望感兴趣的同学可以一起讨论. (一).GPU与CPU 对于浮点数操作能力,CPU与GPU的能力相差在GPU更适用于计算强度高,多并行的计算中.因此,GPU拥有更

windows10下theano启用gpu:CUDA + Anaconda

最近在学习深度学习的相关内容,照着http://deeplearning.net/tutorial/中每个章节的例子一个个对着代码做.并且将部分代码重构了,代码地址在https://github.com/wangzhics/deeplearning里面了. deeplearning里面的代码是基于http://deeplearning.net/software/theano/tutorial/index.html#tutorial的,theano中我们可以定义变量,使用其内置的公式并组合,而不用

Ubuntu16.04 安装配置 Caffe 过程 (GPU版+CUDA 9.0+cuDNN 9.0+OpenCV 3.4.1)

虽然 Caffe 的官网已经有比较详细的针对 Ubuntu 的安装教程,但是要配置可以使用 GPU 的 Caffe 需要的依赖太多,包括 CUDA,cuDNN,OpenCV 等.参考了网上的很多教程,但在自己的配置中依旧出现了各种各样的意想不到的坑,所以在此记录一下自己配置 Caffe 的过程,以供参考.因为是配置完成后以回忆的形式做的记录,所以可能会有细节上的遗漏,还请见谅. 安装 Nvidia 驱动 1. 查询 NVIDIA 显卡驱动 去官网查询自己的显卡对应的驱动 http://www.n

《GPU高性能编程CUDA实战》中代码整理

CUDA架构专门为GPU计算设计了一种全新的模块,目的是减轻早期GPU计算中存在的一些限制,而正是这些限制使得之前的GPU在通用计算中没有得到广泛的应用. 使用CUDA C来编写代码的前提条件包括:(1).支持CUDA的图形处理器,即由NVIDIA推出的GPU显卡,要求显存超过256MB:(2).NVIDIA设备驱动程序,用于实现应用程序与支持CUDA的硬件之间的通信,确保安装最新的驱动程序,注意选择与开发环境相符的图形卡和操作系统:(3).CUDA开发工具箱即CUDA Toolkit,此工具箱

【CUDA并行程序设计系列(1)】GPU技术简介

http://www.cnblogs.com/5long/p/cuda-parallel-programming-1.html 本系列目录: [CUDA并行程序设计系列(1)]GPU技术简介 [CUDA并行程序设计系列(2)]CUDA简介及CUDA初步编程 [CUDA并行程序设计系列(3)]CUDA线程模型 [CUDA并行程序设计系列(4)]CUDA内存 [CUDA并行程序设计系列(5)]CUDA原子操作与同步 [CUDA并行程序设计系列(6)]CUDA流与多GPU 关于CUDA的一些学习资料

CUDA系列学习(三)GPU设计与结构QA &amp; coding练习

啥?你把CUDA系列学习(一),(二)都看完了还不知道為什麼要用GPU提速? 是啊..经微博上的反馈我默默感觉到提出这样问题的小伙伴不在少数,但是更多小伙伴应该是看了(一)就感觉离自己太远所以赶紧撤粉跑掉了...都怪我没有写CUDA系列学习(零)...那么,这一章就补上这一块,通过一堆Q&A进行讲解,并辅助coding练习,希望大家感觉贴近CUDA是这么容易~~ 请注意各个Q&A之间有顺序关系,轻依次阅读~ 否则不容易懂喔~ Q:现在硬件层面通常通过什么样的方法加速?A: - More p

CAFFE安装 CentOS无GPU

前记 由于是在一台用了很久的机器上安装caffe,过程比较复杂,网上说再干净的机器上装比较简单.如果能有干净的机器,就不用再过这么多坑了,希望大家好运!介绍这里就不说了,直接进入正题: Caffe 主页  http://caffe.berkeleyvision.org/ github主页 https://github.com/BVLC/caffe 机器配置: [[email protected] build]# lsb_release -a LSB Version: :base-4.0-amd6