原谅我中英文混杂。
现在,我需要多个程序同时运行,每个程序都会多次运行GPU kernel。这些Kernels 能否并行执行呢? 答案是 不能并行执行 (除非使用 GPU multi-process server)
如果是runtime 创建的primary context,一个程序的多个线程可以共享;通过使用stream,可以实现多个kernel并行执行。
如果是driver 创建的 standard context,一个程序的多个线程是不能共享的;可以通过context migration,将一个线程的context,转移到另外一个线程的context。
如果是多个程序processes,那么是不能共享context的,那么,意味着这些processes需要串行使用GPU。
解释如下:
首先:GPU context是什么
The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process‘ use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.
If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you‘ve discovered, it‘s possible to create multiple contexts from a single process, but not usually necessary.
And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.
CUDA runtime API usage manages contexts for you. You normally don‘t explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.
Context swapping isn‘t a cheap operation. At least in Linux, multiple contexts compete for GPU resources on a first come, first served basis. This include memory (there is no concept of swapping or paging). WDDM versions of windows might work differently because there is an OS level GPU memory manager in play, but I don‘t have any experience with it.
If you have a single GPU, I think you would do better running a persistent thread to hold the GPU context for the life of the application, and then feed the thread work from producer threads. That offers you the ability to impose you own scheduling logic on the GPU and explicitly control how work is processed. That is probably the GPUWorker model, but I am not very familiar with that code‘s inner workings.
Streams are a mechanism for emitting asynchronous commands to a single GPU context so that overlap can occur between CUDA function calls (for example copying during kernel execution). It doesn‘t break the basic 1:1 thread to device context paradigm that CUDA is based around. Kernel execution can‘t overlap on current hardware (the new Fermi hardware it is supposed to eliminate this restriction).
______________________解释得比较好的______________
CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device.
CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when that activity is idle, it can and will context-switch to another context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot normally violate basic GPU limits, such as memory availability for device allocations.)
The "exception" to this case (serialization of GPU activity from independent host processes) would be the CUDA Multi-Process Server. In a nutshell, the MPS acts as a "funnel" to collect CUDA activity emanating from several host processes, and run that activity as if it emanated from a single host process. The principal benefit is to avoid the serialization of kernels which might otherwise be able to run concurrently. The canonical use-case would be for launching multiple MPI ranks that all intend to use a single GPU resource.
Note that the above description applies to GPUs which are in the "Default" compute mode. GPUs in "Exclusive Process" or "Exclusive Thread" compute modes will reject any attempts to create more than one process/context on a single device. In one of these modes, attempts by other processes to use a device already in use will result in a CUDA API reported failure. The compute mode is modifiable in some cases using the nvidia-smi utility.
________________________________________________
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware. So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system. You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.