使用常量内存来处理光线跟踪

项目打包下载

  1 /*
  2 * Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
  3 *
  4 * NVIDIA Corporation and its licensors retain all intellectual property and
  5 * proprietary rights in and to this software and related documentation.
  6 * Any use, reproduction, disclosure, or distribution of this software
  7 * and related documentation without an express license agreement from
  8 * NVIDIA Corporation is strictly prohibited.
  9 *
 10 * Please refer to the applicable NVIDIA end user license agreement (EULA)
 11 * associated with this source code for terms and conditions that govern
 12 * your use of this NVIDIA software.
 13 *
 14 */
 15
 16 #include <GL\glut.h>
 17 #include "cuda.h"
 18 #include "../common/book.h"
 19 #include "../common/cpu_bitmap.h"
 20 #include "cuda_runtime.h"
 21 #include "device_launch_parameters.h"
 22 #include <math.h>
 23 #define DIM 1024
 24
 25 #define rnd( x ) (x * rand() / RAND_MAX)
 26 #define INF     2e10f
 27
 28 struct Sphere {
 29     float   r, b, g;
 30     float   radius;
 31     float   x, y, z;
 32     __device__ float hit(float ox, float oy, float *n) {
 33         float dx = ox - x;
 34         float dy = oy - y;
 35         if (dx*dx + dy*dy < radius*radius) {
 36             float dz = sqrtf(radius*radius - dx*dx - dy*dy);
 37             *n = dz / sqrtf(radius * radius);
 38             return dz + z;
 39         }
 40         return -INF;
 41     }
 42 };
 43 #define SPHERES 20
 44
 45 __constant__ Sphere s[SPHERES];
 46
 47 __global__ void kernel(unsigned char *ptr) {
 48     // map from threadIdx/BlockIdx to pixel position
 49     int x = threadIdx.x + blockIdx.x * blockDim.x;
 50     int y = threadIdx.y + blockIdx.y * blockDim.y;
 51     int offset = x + y * blockDim.x * gridDim.x;
 52     float   ox = (x - DIM / 2);
 53     float   oy = (y - DIM / 2);
 54
 55     float   r = 0, g = 0, b = 0;
 56     float   maxz = -INF;
 57     for (int i = 0; i<SPHERES; i++) {
 58         float   n;
 59         float   t = s[i].hit(ox, oy, &n);
 60         if (t > maxz) {
 61             float fscale = n;
 62             r = s[i].r * fscale;
 63             g = s[i].g * fscale;
 64             b = s[i].b * fscale;
 65             maxz = t;
 66         }
 67     }
 68
 69     ptr[offset * 4 + 0] = (int)(r * 255);
 70     ptr[offset * 4 + 1] = (int)(g * 255);
 71     ptr[offset * 4 + 2] = (int)(b * 255);
 72     ptr[offset * 4 + 3] = 255;
 73 }
 74
 75 // globals needed by the update routine
 76 struct DataBlock {
 77     unsigned char   *dev_bitmap;
 78 };
 79
 80 int main(void) {
 81     DataBlock   data;
 82     // capture the start time
 83     cudaEvent_t     start, stop;
 84     HANDLE_ERROR(cudaEventCreate(&start));
 85     HANDLE_ERROR(cudaEventCreate(&stop));
 86     HANDLE_ERROR(cudaEventRecord(start, 0));
 87
 88     CPUBitmap bitmap(DIM, DIM, &data);
 89     unsigned char   *dev_bitmap;
 90
 91     // allocate memory on the GPU for the output bitmap
 92     HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap,
 93         bitmap.image_size()));
 94
 95     // allocate temp memory, initialize it, copy to constant
 96     // memory on the GPU, then free our temp memory
 97     Sphere *temp_s = (Sphere*)malloc(sizeof(Sphere)* SPHERES);
 98     for (int i = 0; i<SPHERES; i++) {
 99         temp_s[i].r = rnd(1.0f);
100         temp_s[i].g = rnd(1.0f);
101         temp_s[i].b = rnd(1.0f);
102         temp_s[i].x = rnd(1000.0f) - 500;
103         temp_s[i].y = rnd(1000.0f) - 500;
104         temp_s[i].z = rnd(1000.0f) - 500;
105         temp_s[i].radius = rnd(100.0f) + 20;
106     }
107     /*
108     将SPHERES个球面对象存放在常量内存中
109     通过cudaMemcpyToSymbol来操作
110     */
111     HANDLE_ERROR(cudaMemcpyToSymbol(s, temp_s,
112         sizeof(Sphere)* SPHERES));
113     free(temp_s);
114
115     // generate a bitmap from our sphere data
116     dim3    grids(DIM / 16, DIM / 16);
117     dim3    threads(16, 16);
118     kernel <<<grids, threads >>>(dev_bitmap);
119
120     // copy our bitmap back from the GPU for display
121     HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr(), dev_bitmap,
122         bitmap.image_size(),
123         cudaMemcpyDeviceToHost));
124
125     // get stop time, and display the timing results
126     HANDLE_ERROR(cudaEventRecord(stop, 0));
127     HANDLE_ERROR(cudaEventSynchronize(stop));
128     float   elapsedTime;
129     HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,
130         start, stop));
131     printf("Time to generate:  %3.1f ms\n", elapsedTime);
132
133     HANDLE_ERROR(cudaEventDestroy(start));
134     HANDLE_ERROR(cudaEventDestroy(stop));
135
136     HANDLE_ERROR(cudaFree(dev_bitmap));
137
138     // display
139     bitmap.display_and_exit();
140 }

结果如下所示：

时间： 2024-10-07 03:58:07

使用常量内存来处理光线跟踪的相关文章

CUDA学习日志：常量内存和纹理内存

接触CUDA的时间并不长,最开始是在cuda-convnet的代码中接触CUDA代码,当时确实看的比较痛苦.最近得空,在图书馆借了本<GPU高性能编程 CUDA实战>来看看,同时也整理一些博客来加强学习效果. Jeremy Lin 在上一篇博文中,我们谈到了如何利用共享内存来实现线程协作的问题.本篇博文我们主要来谈谈如何利用常量内存和纹理内存来提高程序性能. 常量内存所谓的常量内存,从它的名字我们就可以知道,它是用来保存在核函数执行期间不会发生变化的数据.NVIDIA硬件提供了64KB的常量

CUDA学习5 常量内存与事件

当线程束中的所有线程都访问相同的只读数据时,使用常量内存将获得额外的性能提升. 常量内存大小限制为64k. 以下摘自hackairM的博文CUDA学习--内存处理之常量内存(4). 常量内存其实只是全局内存的一种虚拟地址形式,并没有特殊保留的常量内存块.常量内存有两个特性,一个是高速缓存,另一个是它支持将单个值广播到线程束中的每个线程.但要注意的是,对于那些数据不太集中或者数据重用率不高的内存访问,尽量不要使用常量内存. 当常量内存将数据分配或广播到线程束中的每个线程时(注意,实际上硬件会将单次

cuda学习3-共享内存和同步

为什么要使用共享内存呢,因为共享内存的访问速度快.这是首先要明确的,下面详细研究. cuda程序中的内存使用分为主机内存(host memory) 和设备内存(device memory),我们在这里关注的是设备内存.设备内存都位于gpu之上,前面我们看到在计算开始之前,每次我们都要在device上申请内存空间,然后把host上的数据传入device内存.cudaMalloc()申请的内存,还有在核函数中用正常方法申请的变量的内存.这些内存叫做全局内存,那么还有没有别的内存种类呢?常用的还有共

cuda内存总结

1．shared memory __shared__ 声明为共享内存,将会保存在共享内存中 2．constant memory __constant__ 声明为常量内存,将会保存在常量内存中,常量内存是只读内存,声明时要静态的分配空间将数据从CPU拷贝到常量内存中时用cudaMemcpyToSymbol,例如cudaMemcpyToSymbol( s, temp_s,sizeof(Sphere) * SPHERES) 常量内存带来性能提升的原因: 1．对常量内存的单次读操作可以广播到临近线程

NSString初始化及其内存管理

1. NSString 3种初始化方法 a. 直接赋值: NSString *str = @"a string", 这种方式使用常量内存. b. 使用类方法: NSString *str = [NSString stringWithString:@"a string"]; NSString * str = [NSString stringWithFormat:@"age is %i, height is %f", 20, 180]; c.

c++ const常量的实现机制（转载）

const关键字的基本思想就是将一个变量变成常量,试图从语言设计者的角度去分析引入该关键字的动机. 我猜测原因如下: 1)提醒程序员,某些值是常量,如PI.或该值在程序运行期间是不变的,防止程序员误修改.对于多模块开发,头文件引用复杂的系统来说,这是很重要的,如果不能在编译期阻止常量被误修改,那么这些修改将引起运行时调试困难的问题. 2)改进c语言中的#define预处理宏,1)中提到的问题使用#define宏也能够得到解决,但是#define宏作为解决方案存在一些缺陷:a)#define宏会产

NSString的内存管理问题 (转载）

NSString是一个不可变的字符串对象.这不是表示这个对象声明的变量的值不可变,而是表示它初始化以后,你不能改变该变量所分配的内存中的值,但你可以重新分配该变量所处的内存空间. 生成一个NSString类型的字符串有三种方法: 方法1.直接赋值: NSString *str1 = @"my string"; 方法2.类函数初始化生成: NSString *str2 = [NSString stringWithString:@"my string"]

JVM内存监视手段和内存溢出解决方案

引言本文仅关注一些常见的虚拟机内存监视手段,以及JVM运行时数据区各个部分内存溢出的发生和对应的解决方案,总体来说属于概括性总结,涉及相对不是很深入,目的是让自己和其它初学者有一个框架性.概念性的了解,当遇到问题时有迹可循.不至于不知所措. 一.虚拟机内存监视手段虚拟机常出现的问题包括:内存泄露.内存溢出.频繁GC导致性能下降等,导致这些问题的原因可以通过下面虚拟机内存监视手段来进行分析,具体实施时可能需要灵活选择,同时借助两种甚至更多的手段来共同分析. 比如GC日志可以分析出哪些GC较为频

基于纹理内存的CUDA热传导模拟

原文链接项目中有三个,第一个是全局内存,其余两个分别是基于1d和2d纹理内存.项目打包下载. 纹理内存是只读内存,与常量内存相同的是,纹理内存也缓存在芯片中,因此某些情况下,它能减少对内存的请求并提供更高效的内存宽带.纹理内存专门为那些内存访问模式中存在大量空间局部性的图形应用程序而设计的.在某个计算应用程序中,这意味着一个线程读取的位置可能与邻近线程读取的位置“非常接近”.纹理缓存为了加速访问不连续的地址而设计的. 温度计算的内存访问模式中存在着巨大的内存空间局部性,这种访问模式可以用GPU