Intro to Parallel Programming
How do you dig a hole faster?
GPU理念
很多很多简单计算单元;
清洗的并行计算模型;
关注吞吐量而非延迟;
CPU: HOST
GPU:DEVICE
A Typical GPU Program
1,CPUallocates(分配) storage on GPU cuda Malloc
2,CPUcopies input data from CPU-GPU
cuda Memcpy
3,CPUlaunches kernel(s) on GPU to process the data
Kernel launch
4,CPUCOPIES RESULTS BACK TO CPU FROM GPU
cuda Memcpy
最好是最后一步将GPU数据拷回
DEFINING THE GPU COMPUTATION
BIG IDEA
KERNELS LOOK LIKE SERIAL PROGRAMS
WRITE YOUR PROGRAM AS IF IT WILL RUN ON ONE THREAD
THE GPU WILL RUN THAT PROGRAM ON MANY THREADS
MAKE SUREYOU UNDERSTAND THIS
THIS ISIMPORTANT
WHAT IS GPU GOOD AT?
1,EFFICIENTLY LAUNCHING LOTS OF THREADS
2,RUNNING LOTS OF THREADS IN PARALLEL
SIMPLE EXAMPLE:
IN: FLOATARRAY [0 1 2 … 36]
OUT: FLOATARRAY [0 1X1 2X2 … 63X63]
[0 1 4 9 … ]
KERNEL: SQUARE
CPU CODE: SQUARE EACH ELEMENT OF ANARRAY
For (i=0; i<64;i++){
Out[i]=in[i]*in[i];
}
1, ONLYONE THREAD OF EXECUTION
("thread”=one Independentpath of execution through the code”)
2,NOEXPLICIT PARALLELSIM
GPU CODE: A HIGH-LEVEL VIEW
CPU
ALLOCATEMEMORY
COPYDATA TO/FROM GPU
LAUNCHKERNEL
SPECIFIES DEGREE OFPARALLELISM
GPU
EXPRESSOUT = IN . IN
SAMS NOTHING
ABOUT THE DEGREE OFPARALLELISM
CPU CODE: square kernel <<< 64 >>>(outArrayinArray)
BUT HOW DOES IT WORK IF I LAUNCH 64INSTANCES OF THE SAME PROGRAM?
CPU LAUNCHES 64 THREADS
#include <stdio.h>
__global__ void cube(float * d_out, float *d_in){
//Todo: Fill in this function
}
int main(int argc, char ** argv) {
constint ARRAY_SIZE = 96;
constint ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
//generate the input array on the host
floath_in[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++) {
h_in[i]= float(i);
}
floath_out[ARRAY_SIZE];
//declare GPU memory pointers
float* d_in;
float* d_out;
//allocate GPU memory
cudaMalloc((void**)&d_in, ARRAY_BYTES);
cudaMalloc((void**)&d_out, ARRAY_BYTES);
//transfer the array to the GPU
cudaMemcpy(d_in,h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
//launch the kernel
cube<<<1,ARRAY_SIZE>>>(d_out, d_in);
//copy back the result array to the CPU
cudaMemcpy(h_out,d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
//print out the resulting array
for(int i =0; i < ARRAY_SIZE; i++) {
printf("%f",h_out[i]);
printf(((i% 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return0;
}
Configuring the kernel Launch
SQUARE<<<1,64>>>(d_ou,d_in)
<<<block数量,每个block的线程>>>
1,可以一次run多个block
2,每个threads/block的最大值(512老版本)
(1024新版本)
128县城 SQUARE<<<1,128>>>( ….)
1280县城 SQUARE<<<10,128>>>( ….)
SQUARE<<<5,256>>>( …. )
KERNAL<<<GRID OF BLOCKS,BLOCK OFTHREADS>>>(…)
KERNAL<<<1,2OR3D, 1,2OR3D >>>(…)
Dim3(x,y,z)
Dim3(w,1,1)==dim3(w)==w
Square<<<1,64>>>==square<<<dim3(1,1,1),dim3(64,1,1)>>>
Kernel<<<grid of blocks,block ofthreads>>>( … )
Square<<<dim3(bx,by,bz),dim3(tx,ty,tz),shmem>>>(… )
Square<<<gridof blocks bx.by.bz,block of threads tx.ty.tz, shared memory per block in bytes>>>
Thread idx :thread within block
Thread idx.xthread idx.y
Block dim: size ofa block
Block idx: blockwithin grid
gridDim: size ofgrid
MAP
setof elements to process [64 floats]
functionto run on each element [“square”]
map(elements,function)
gpuare good at map
--gpu have many parallel processors
--gpu optimize for throughput
Struct uchar4{
Unsigned char x;
Unsigned char y;
Unsigned char z;
Unsigned char w;
}
Converting color to black and white
I = (R+G+B)/3
I = .299f*R + .587f*G + .114f*B