关于多个程序同时launch kernels on the same GPU

原谅我中英文混杂。

现在,我需要多个程序同时运行,每个程序都会多次运行GPU kernel。这些Kernels 能否并行执行呢? 答案是 不能并行执行 (除非使用 GPU multi-process server)

如果是runtime 创建的primary context,一个程序的多个线程可以共享;通过使用stream,可以实现多个kernel并行执行。

如果是driver 创建的 standard context,一个程序的多个线程是不能共享的;可以通过context migration,将一个线程的context,转移到另外一个线程的context。

如果是多个程序processes,那么是不能共享context的,那么,意味着这些processes需要串行使用GPU。

解释如下:

首先:GPU context是什么

The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process‘ use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.

If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you‘ve discovered, it‘s possible to create multiple contexts from a single process, but not usually necessary.

And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.

CUDA runtime API usage manages contexts for you. You normally don‘t explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.

Context swapping isn‘t a cheap operation. At least in Linux, multiple contexts compete for GPU resources on a first come, first served basis. This include memory (there is no concept of swapping or paging). WDDM versions of windows might work differently because there is an OS level GPU memory manager in play, but I don‘t have any experience with it.

If you have a single GPU, I think you would do better running a persistent thread to hold the GPU context for the life of the application, and then feed the thread work from producer threads. That offers you the ability to impose you own scheduling logic on the GPU and explicitly control how work is processed. That is probably the GPUWorker model, but I am not very familiar with that code‘s inner workings.

Streams are a mechanism for emitting asynchronous commands to a single GPU context so that overlap can occur between CUDA function calls (for example copying during kernel execution). It doesn‘t break the basic 1:1 thread to device context paradigm that CUDA is based around. Kernel execution can‘t overlap on current hardware (the new Fermi hardware it is supposed to eliminate this restriction).

______________________解释得比较好的______________

CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device.

CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when that activity is idle, it can and will context-switch to another context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot normally violate basic GPU limits, such as memory availability for device allocations.)

The "exception" to this case (serialization of GPU activity from independent host processes) would be the CUDA Multi-Process Server. In a nutshell, the MPS acts as a "funnel" to collect CUDA activity emanating from several host processes, and run that activity as if it emanated from a single host process. The principal benefit is to avoid the serialization of kernels which might otherwise be able to run concurrently. The canonical use-case would be for launching multiple MPI ranks that all intend to use a single GPU resource.

Note that the above description applies to GPUs which are in the "Default" compute mode. GPUs in "Exclusive Process" or "Exclusive Thread" compute modes will reject any attempts to create more than one process/context on a single device. In one of these modes, attempts by other processes to use a device already in use will result in a CUDA API reported failure. The compute mode is modifiable in some cases using the nvidia-smi utility.

________________________________________________

A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware. So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system. You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.

时间: 2024-10-15 05:57:14

关于多个程序同时launch kernels on the same GPU的相关文章

编写第一个OpenACC程序

原文链接 在PGI的官方网站上获得示例代码: http://www.pgroup.com/lit/samples/pgi_accelerator_examples.tar 我们的第一个例子从一个简单的程序开始.这个程序是把一个浮点向量送到GPU上,然后乘以2.再把结果返回.整个程序是: 1 #include <stdio.h> 2 #include <stdlib.h> 3 #include <assert.h> 4 int main( int argc, char*

GPU 编程入门到精通(四)之 GPU 程序优化

博主因为工作其中的须要,開始学习 GPU 上面的编程,主要涉及到的是基于 GPU 的深度学习方面的知识,鉴于之前没有接触过 GPU 编程.因此在这里特地学习一下 GPU 上面的编程.有志同道合的小伙伴.欢迎一起交流和学习,我的邮箱: [email protected] . 使用的是自己的老古董笔记本上面的 Geforce 103m 显卡,尽管显卡相对于如今主流的系列已经很的弱.可是对于学习来说,还是能够用的.本系列博文也遵从由简单到复杂.记录自己学习的过程. 0. 文件夹 GPU 编程入门到精通

[linuxeden] 程序员的野心:让GPU像CPU一样运行

GPU代表的是图形处理单元,但是,这些小小芯片除了处理图形功能,还有其它用处.比如,Google使用GPU来为人脑建模,Salesforce则依赖GPU分析Twitter微博数据流.GPU很适合并行处理运算,也就是同时执行成千上万个任务.怎么做呢?你得开发一个新软件,让它挖掘GPU芯片的潜力.最近美国印第安纳大学计算机博士埃里克-浩克(Eric Holk)就作出尝试,他开发了一个应用程序来运行GPU.浩克说:“GPU编程仍然需要程序员管理许多低层细节,这些细节是与GPU执行的主要任务分离 的.我

快速开发CUDA程序的方法

根据几年的CUDA开发经验,简单的介绍下CUDA程序的大概开发步骤,按照先修改CPU串行程序后移植到GPU平台的原理,把需要在GPU上做的工作尽量先在CPU平台上修改,降低了程序的开发难度,同时有利用bug的调试.通过实现一种快速.有效地CUDA并行程序开发的方法,提高CUDA并行程序开发效率,降低CUDA并行程序开发周期和难度. (1)    CPU串行程序分析 对于CPU串行程序,首先需要测试串行程序中的热点函数,以及分析热点函数的并行性: a)       热点测试 根据时间的测试结果确定

程序员应知 -- 如何分析海量数据

程序员应知 -- 如何分析海量数据 http://www.cnblogs.com/MicroTeam/archive/2010/12/03/1895071.html 在这个云计算热炒的时代,如果你没有处理过海量数据的话,你将不再是个合格的Coder.现在赶紧补补吧~ 前一阵子分析了一个将近1TB的数据群(gz文件,压缩率10%).因为第一次分析如此巨大的数据,没有经验,所以浪费了许多时间.下面是我整理的一些经验,方便后者. 欢迎各种补充,我会不断更新这篇文章:觉得有用的话,速度分享链接:有不同意

Ubuntu-Tensorflow 程序结束掉GPU显存没有释放的问题

笔者在ubuntu上跑Tensorflow的程序的时候,中途使用了Win+C键结束了程序的进行,但是GPU的显存却显示没有释放,一直处于被占用状态. 使用命令 nvidia-smi 显示如下 两个GPU程序都在执行中,实际上GPU:0已经被笔者停止了,但是GPU没有释放,进程还在继续,所以只有采用暴力手段了,将进程手动关闭掉,进程编号如图中红线部分,由于笔者在两个GPU跑的程序一样,很难从程序名称上找到自己,却可以从GPU:num上找到自己的PID. 关闭命令如下: sudo kill -9 P

CUDA从入门到精通

CUDA从入门到精通(零):写在前面 在老板的要求下,本博主从2012年上高性能计算课程开始接触CUDA编程,随后将该技术应用到了实际项目中,使处理程序加速超过1K,可见基于图形显示器的并行计算对于追求速度的应用来说无疑是一个理想的选择.还有不到一年毕业,怕是毕业后这些技术也就随毕业而去,准备这个暑假开辟一个CUDA专栏,从入门到精通,步步为营,顺便分享设计的一些经验教训,希望能给学习CUDA的童鞋提供一定指导.个人能力所及,错误难免,欢迎讨论. PS:申请专栏好像需要先发原创帖超过15篇...

CUDA C编程入门-编程模型

这章节介绍CUDA编程模型的主要的概念. 2.1.kernels(核函数) CUDA C扩展了C语言,允许程序员定义C函数,称为kernels(核函数).并行地在N个CUDA线程中执行N次. 使用__global__说明符声明一个核函数,调用使用<<<...>>>,并且指定执行的CUDA线程数目.执行的每个线程都有一个独一的ID,在核函数中可以通过变量threadIdx获取. 例子,两个向量的加,A加B,并把结果存入C,A.B和C的长度为N. __global__ vo

《GPU高性能编程CUDA实战》中代码整理

CUDA架构专门为GPU计算设计了一种全新的模块,目的是减轻早期GPU计算中存在的一些限制,而正是这些限制使得之前的GPU在通用计算中没有得到广泛的应用. 使用CUDA C来编写代码的前提条件包括:(1).支持CUDA的图形处理器,即由NVIDIA推出的GPU显卡,要求显存超过256MB:(2).NVIDIA设备驱动程序,用于实现应用程序与支持CUDA的硬件之间的通信,确保安装最新的驱动程序,注意选择与开发环境相符的图形卡和操作系统:(3).CUDA开发工具箱即CUDA Toolkit,此工具箱