Cuda learn record two

这是一个cuda 自带的算例，包含cuda 计算的一般流程。

这个地址有比较清楚的cuda的介绍。感谢作者分享（http://blog.csdn.net/hjimce/article/details/51506207）

一般来说，cuda 计算的流程是:

1. 设置显卡编号：cudaSetDevice；这个主要是在有多个GPU的机器上使用，其编号是从0号开始。

2. 为显卡开辟内存变量： cudaMalloc；使用方法：cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));

这里的指针是指向设备端的内存地址，无法再主机端使用。

3.把主机端的数据拷贝到设备端：cudaMemcpy; 使用方法：

cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);

这里注意需要指明数据传输的地址，

4. 调用内核函数__global__ 类型函数；

cudaAdd<<<blocksPerGrid, threadsPerBlock>>> ( )

这里blocksPerGrid, threadsPerBlock 都是Dim3型的数据，

5. 把计算结果拷贝到主机端。

6. 释放显存空间。

  1 #include "cuda_runtime.h"
  2 #include "device_launch_parameters.h"
  3
  4 #include <stdio.h>
  5
  6 static void HandleError(cudaError_t err,
  7     const char *file,
  8     int line) {
  9     if (err != cudaSuccess) {
 10         printf("%s in %s at line %d\n", cudaGetErrorString(err),
 11             file, line);
 12         exit(EXIT_FAILURE);
 13     }
 14 }
 15 #define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
 16
 17 cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
 18 void printCudaInformation();
 19
 20 __global__ void addKernel(int *c, const int *a, const int *b)
 21 {
 22     int i = threadIdx.x;
 23     c[i] = a[i] + b[i];
 24 }
 25
 26 int main()
 27 {
 28     const int arraySize = 5;
 29     const int a[arraySize] = { 1, 2, 3, 4, 5 };
 30     const int b[arraySize] = { 10, 20, 30, 40, 50 };
 31     int c[arraySize] = { 0 };
 32
 33     // Add vectors in parallel.
 34     HANDLE_ERROR( addWithCuda(c, a, b, arraySize) );
 35
 36     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
 37         c[0], c[1], c[2], c[3], c[4]);
 38
 39     // cudaDeviceReset must be called before exiting in order for profiling and
 40     // tracing tools such as Nsight and Visual Profiler to show complete traces.
 41     HANDLE_ERROR( cudaDeviceReset() );
 42
 43     system("pause");
 44     printCudaInformation();
 45     system("pause");
 46     return 0;
 47 }
 48
 49 // Helper function for using CUDA to add vectors in parallel.
 50 cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
 51 {
 52     int *dev_a = 0;
 53     int *dev_b = 0;
 54     int *dev_c = 0;
 55     cudaError_t cudaStatus=cudaSuccess;
 56
 57     // Choose which GPU to run on, change this on a multi-GPU system.
 58     HANDLE_ERROR(cudaSetDevice(0));
 59
 60     // Allocate GPU buffers for three vectors (two input, one output)
 61     HANDLE_ERROR(cudaMalloc((void**)&dev_c, size * sizeof(int)));
 62     HANDLE_ERROR(cudaMalloc((void**)&dev_a, size * sizeof(int)));
 63     HANDLE_ERROR(cudaMalloc((void**)&dev_b, size * sizeof(int)));
 64
 65     // Copy input vectors from host memory to GPU buffers.
 66     HANDLE_ERROR(cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice));
 67     HANDLE_ERROR(cudaMemcpy(dev_b, a, size * sizeof(int), cudaMemcpyHostToDevice));
 68
 69
 70     // Launch a kernel on the GPU with one thread for each element.
 71     addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
 72
 73     // Check for any errors launching the kernel
 74     HANDLE_ERROR(cudaGetLastError());
 75
 76     // cudaDeviceSynchronize waits for the kernel to finish, and returns
 77     // any errors encountered during the launch.
 78     HANDLE_ERROR(cudaDeviceSynchronize());
 79
 80     // Copy output vector from GPU buffer to host memory.
 81     HANDLE_ERROR(cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost));
 82
 83     return cudaStatus;
 84 }
 85
 86 void printCudaInformation()
 87 {
 88     int count;
 89     cudaGetDeviceCount(&count);
 90     printf("count=%d \n", count);
 91     cudaDeviceProp myProp;
 92     cudaGetDeviceProperties(&myProp, 0);
 93     printf(" --- General Information of My Cuda Device ---\n");
 94     printf("     Device name: %s\n", myProp.name);
 95     printf("     Computer capatibility : %d.%d\n", myProp.major, myProp.minor);
 96     printf("     Clock rate: %d\n", myProp.clockRate);
 97
 98     printf(" --- Memory Information of My Cuda Device ---\n");
 99     printf("    Total global memory: %ld =%d double \n", myProp.totalGlobalMem, myProp.totalGlobalMem / sizeof(double));
100     printf("    Total const memory: %ld =%d int \n", myProp.totalConstMem, myProp.totalConstMem / sizeof(int));
101     printf("    max memoory pitch: %ld \n", myProp.memPitch);
102
103     printf(" --- Multiprocessor Information of My Cuda Device ---\n");
104     printf("    multprocessor count= %d\n", myProp.multiProcessorCount);
105     printf("    Shared mem per mp=%d\n", myProp.sharedMemPerBlock);
106     printf("    Registers per mp=%d\n", myProp.regsPerBlock);
107     printf("    Thread in wrap=%d\n", myProp.warpSize);
108     printf("    Max thread per block=%d\n", myProp.maxThreadsPerBlock);
109     printf("    Max threads dimensions= (%d, %d, %d) \n",
110         myProp.maxThreadsDim[0], myProp.maxThreadsDim[1], myProp.maxThreadsDim[2]);
111     printf("    Max Grid dimensions= (%d, %d, %d) \n",
112         myProp.maxGridSize[0], myProp.maxGridSize[1], myProp.maxGridSize[2]);
113     printf("\n");
114 }

时间： 2024-08-24 01:29:24

Cuda learn record two的相关文章

Cuda learn record one

1. GPU 有完善的内存管理的机制,会强制结束任何违反内存访问规则的进程,但是无法阻止应用程序的继续执行,因而,错误处理函数非常重要. 1 static void HandleError( cudaError_t err, 2 const char *file, 3 int line ) { 4 if (err != cudaSuccess) { 5 printf( "%s in %s at line %d\n", cudaGetErrorString( err ), 6 file,

Machine Learning for Developers

Machine Learning for Developers Most developers these days have heard of machine learning, but when trying to find an 'easy' way into this technique, most people find themselves getting scared off by the abstractness of the concept of Machine Learnin

教程 1：让我们通过例子来学习（Tutorial 1: Let’s learn by example）

通过这第一个教程,我们将引导您从基础完成创建简单的带有注册表单的应用. 我们也将解释框架行为的基本方面.如果您对Phalcon的自动代码生成工具有兴趣, 您可以查看 developer tools. 确认安装(Checking your installation)? We'll assume you have Phalcon installed already. Check your phpinfo() output for a section referencing "Phalcon"

CUDA matrixMulCUBLAS

1 /////////////////////////////////////////////////////////////////////////// 2 // matirxmulCUBLAS.cpp 3 // 4 // Copyright 1993-2013 NVIDIA Corporation. All rights reserved. 5 // 6 // Please refer to the NVIDIA end user license agreement (EULA) assoc

CUDA 矩阵相乘完整代码

#include "cuda_runtime.h"#include "device_launch_parameters.h" #include <stdio.h>#include <stdlib.h>#include <time.h>#include "cublas_v2.h" #define BLOCK_SIZE 16 cudaError_t multiCuda(float *c, float *a, flo

CUDA ---- Stream and Event

Stream 一般来说,cuda c并行性表现在下面两个层面上: Kernel level Grid level 到目前为止,我们讨论的一直是kernel level的,也就是一个kernel或者一个task由许多thread并行的执行在GPU上.Stream的概念是相对于后者来说的,Grid level是指多个kernel在一个device上同时执行. Stream和event简介 Cuda stream是指一堆异步的cuda操作,他们按照host代码调用的顺序执行在device上.Strea

Beginners Guide To Learn Dimension Reduction Techniques

Beginners Guide To Learn Dimension Reduction Techniques Introduction Brevity is the soul of wit This powerful quote by William Shakespeare applies well to techniques used in data science & analytics as well. Intrigued ? Allow me to prove it using a s

Learn to write a blog

Some people advices college stdudent shoud learn to write a blog.To record beautiful life and think something meaning about life. And some guys tell me that it's necessary to have a blog for a technician. 1.Notes studying. It's a common phenomenon fo

CUDA C Best Practices Guide 在线教程学习笔记 Part 1

0. APOD过程 ● 评估.分析代码运行时间的组成,对瓶颈进行并行化设计.了解需求和约束条件,确定应用程序的加速性能改善的上限. ● 并行化.根据原来的代码,采用一些手段进行并行化,例如使用现有库,或加入一些预处理指令等.同时需要代码重构来暴露它们固有的并行性. ● 优化.并行化完成后,需要通过优化来提高性能.优化可以应用于各个级别,从数据传输到计算到浮点操作序列的微调.分析工具对这一过程非常有用,可以建议开发人员优化工作的下一个策略. ● 部署.将结果与原始期望进行比较.回想一下,初始评估步