R – GPU Programming for All with ‘gpuR’

INTRODUCTION

GPUs (Graphic Processing Units) have become much more popular in recent years for computationally intensive calculations.  Despite these gains, the use of this hardware has been very limited in the R programming language.  Although possible, the prospect of programming in either OpenCL or CUDA is difficult for many programmers unaccustomed to working with such a low-level interface.  Creating bindings for R’s high-level programming that abstracts away the complex GPU code would make using GPUs far more accessible to R users.  This is the core idea behind the gpuR package.  There are three novel aspects of gpuR:

  1. Applicable on ‘ALL’ GPUs
  2. Abstracts away CUDA/OpenCL code to easily incorporate in to existing R algorithms
  3. Separates copy/compute functions to allow objects to persist on GPU

BROAD APPLICATION:

The ‘gpuR’ package was created to bring the power of GPU computing to any R user with a GPU device.  Although there are a handful of packages that provide some GPU capability (e.g.gputoolscudaBayesregHiPLARMHiPLARb, and gmatrix) all are strictly limited to NVIDIA GPUs.  As such, a backend that is based upon OpenCL would allow all users to benefit from GPU hardware.  The ‘gpuR’ package therefore utilizes the ViennaCL linear algebra library which contains auto-tuned OpenCL kernels (among others) that can be leveraged for GPUs.  The headers have been conveniently repackaged in the RViennaCLpackage.  It also allows for a CUDA backend for those with NVIDIA GPUs that may see further improved performance (contained within the companion gpuRcuda package not yet formally released).

ABSTRACT AWAY GPU CODE:

The gpuR package uses the S4 object oriented system to have explicit classes and methods that all the user to simply cast their matrix or vector and continue programming in R as normal.  For example:


1

2

3

4

5

6

7

8

9

10

11

12

ORDER = 1024

A = matrix(rnorm(ORDER^2), nrow=ORDER)

B = matrix(rnorm(ORDER^2), nrow=ORDER)

gpuA = gpuMatrix(A, type="double")

gpuB = gpuMatrix(B, type="double")

C = A %*% B

gpuC = gpuA %*% gpuB

all(C == gpuC[])

[1] TRUE

The gpuMatrix object points to a matrix in RAM which is then computed by the GPU when a desired function is called.  This avoids R’s habit of copying the memory of objects.  For example:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

library(pryr)

# Initially points to same object

x = matrix(rnorm(16), 4)

y = x

address(x)

[1] "0x16177f28"

address(y)

[1] "0x16177f28"

# But once modify the second object it creates a copy

y[1,1] = 0

address(x)

[1] "0x16177f28"

address(y)

[1] "0x15fbb1d8

In contrast, the same syntax for a gpuMatrix will modify the original object in-place without any copy.


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

library(pryr)

library(gpuR)

# Initially points to same object

x = gpuMatrix(rnorm(16), 4, 4)

y = x

[email protected]

[1] <pointer: 0x6baa040>

[email protected]

[1] <pointer: 0x6baa040>

# Modification affects both objects without copy

y[1,1] = 0

[email protected]

[1] <pointer: 0x6baa040>

[email protected]

[1] <pointer: 0x6baa040>

Each new variable assigned to this object will only copy the pointer thereby making the program more memory efficient.  However, the gpuMatrix> class does still require allocating GPU memory and copying data to device for each function call. The most commonly used methods have been overloaded such as  %*%, +, -, *, /, crossprod, tcrossprod, and trig functions among others.  In this way, an R user can create these objects and leverage GPU resources without the need to know a bunch more functions that would break existing algorithms.

DISTINCT COPY/COMPUTE FUNCTIONALITY:

For the gpuMatix and gpuVector classes there are companion vclMatrix andvclVector class that point to objects that persist in the GPU RAM.  In this way, the user explicitly decides when data needs to be moved back to the host.  By avoiding unnecessary data transfer between host and device performance can significantly improve.  For example:


1

2

3

4

5

6

7

8

9

vclA = vclMatrix(rnorm(10000), nrow = 100)

vclB = vclMatrix(rnorm(10000), nrow = 100)

vclC = vclMatrix(rnorm(10000), nrow = 100)

# GEMM

vclD = vclA %*% vclB

# Element-wise addition

vclD = vclD + vclC

In this code, the three initial matrices already exist in the GPU memory so no data transfer takes place in the GEMM call.  Furthermore, the returned matrix remains in the GPU memory.  In this case, the ‘vclD’ object is still in GPU RAM. As such, the element-wise addition call also happens directly on the GPU with no data transfers. It is worth also noting that the user can still modify elements, rows, or columns with the exact same syntax as a normal R matrix.


1

2

3

vclD[1,1] = 42

vclD[,2] = rep(12, 100)

vclD[3,] = rep(23, 100)

These operations simply copy the new elements to the GPU and modify the object in-place within the GPU memory. The ‘vclD’ object is never copied to the host.

BENCHMARKS:

With all that in mind, how does gpuR perform?  Here are some general benchmarks of the popular GEMM operation.  I currently only have access to a single NVIDIA GeForce GTX 970 for these simulations.  Users should expect to see differences with high performance GPUs (e.g. AMD FirePro, NVIDIA Tesla, etc.). Speedup relative to CPU will also vary depending upon user hardware.

(1) DEFAULT DGEMM VS BASE R

R is known to only support two numeric types (integer and double).  As such, Figure 1 shows the fold speedup achieved by using the gpuMatrix and vclMatrix classes.  Since R is already known to not be the fastest language, an implementation with the OpenBLAS backend is included as well for reference using a 4 core Intel i5-2500 CPU @ 3.30GHz.  As can be seen there is a dramatic speedup from just using OpenBLAS or the gpuMatrix class (essentially equivalent).  Of interest is the impact of the transfer time from host-device-host that is typical in many GPU implementations.  This cost is eliminated by using the vclMatrix class which continues to scale with matrix size.

Figure 1 – Fold speedup achieved using openblas (CPU) as well as the gpuMatrix/vclMatrix (GPU) classes provided in gpuR.

(2) SGEMM VS BASE R

In many GPU benchmarks there is often float operations measured as well.  As noted above, R does not provide this by default.  One way to go around this is to use the RcppArmadillo package and explicitly casting R objects as float types.  The armadillo library will also default to using the BLAS backend provided (i.e. OpenBLAS).  Figure 2 shows the impact of using float data types.  OpenBLAS continues to provide a noticeable speedup but gpuMatrix begins to outperform once matrix order exceeds 1500.  The vclMatrix continues to demonstrate the value in retaining objects in GPU memory and avoiding memory transfers.

Figure 2 – Float type GEMM comparisons. Fold speedup achieved using openblas (via RcppArmadillo) as well as the gpuMatrix/vclMatrix (GPU) classes provided in gpuR.

To give an additional view on the performance achieved by gpuMatrix and vclMatrix is comparing directly against the OpenBLAS performance.  The gpuMatrix reaches ~2-3 fold speedup over OpenBLAS whereas vclMatrix scales to over 100 fold speedup!  It is curious as to why the performance with vclMatrix is so much faster (only differing in host-device-host transfers).  Further optimization with gpuMatrix will need to be explored (fresh eyes are welcome) accepting limitations in the BUS transfer speed.  Performance will certainly improve with improved hardware capabilities such as NVIDIA’s NVLink.

Figure 3 – Fold speedup achieved over openblas (via RcppArmadillo) float type GEMM comparisons vs the gpuMatrix/vclMatrix (GPU) classes provided in gpuR.

CONCLUSION

The gpuR package has been created to bring GPU computing to as many R users as possible.  It is the intention to use gpuR to more easily supplement current and future algorithms that could benefit from GPU acceleration.  The gpuR package is currently available on CRAN.  The development version can be found on my github in addition to existing issues and wiki pages (assisting primarily in installation).  Future developments include solvers (e.g. QR, SVD, cholesky, etc.), scaling across multiple GPUs,  ‘sparse’ class objects, and custom OpenCL kernels.

As noted above, this package is intended to be used with a multitude of hardware and operating systems (it has been tested on Windows, Mac, and multiple Linux flavors).  I only have access to a limited set of hardware (I can’t access every GPU, let along the most expensive).  As such, the development of gpuR depends upon the R user community.  Volunteers who possess different hardware are always welcomed and encouraged to submit issues regarding any discovered bugs.  I have begun a gitter account for users to report on successful usage with alternate hardware.  Suggestions and general conversation about gpuR is welcome.

转自:http://www.parallelr.com/r-gpu-programming-for-all-with-gpur/

时间: 2024-08-06 23:26:24

R – GPU Programming for All with ‘gpuR’的相关文章

把书《CUDA By Example an Introduction to General Purpose GPU Programming》读薄

鉴于自己的毕设需要使用GPU CUDA这项技术,想找一本入门的教材,选择了Jason Sanders等所著的书<CUDA By Example an Introduction to General Purpose GPU Programming>.这本书作为入门教材,写的很不错.自己觉得从理解与记忆的角度的出发,书中很多内容都可以被省略掉,于是就有了这篇博文.此博文记录与总结此书的笔记和理解.注意本文并没有按照书中章节的顺序来写.书中第8章图像互操作性和第11章多GPU系统上的CUDA C,这

CUDA Intro to Parallel Programming笔记--Lesson 1 The GPU Programming Model

1.  3 traditional ways computes run faster Faster clocks More work/clock cycle More processors 2. Parallelism A high end Gpu contains over 3,000 arithmatic units,ALUs, that can simultanously run 3,000 arithmetic operations. GPU can have tens of thous

PatentTips - Heterogeneous Parallel Primitives Programming Model

BACKGROUND 1. Field of the Invention The present invention relates generally to a programming model for a heterogeneous processor system. 2. Background Art With the success of programming models such as OpenCL and CUDA, heterogeneous computing platfo

正则表达式及R字符串处理之终结版

转载于: 正则表达式及R字符串处理之终结版 0.动机:为什么学习字符串处理 传统的统计学教育几乎没有告诉过我们,如何进行文本的统计建模分析.然而,我们日常生活中接触到的大部分数据都是以文本的形式存在.文本分析与挖掘在业界中也有着非常广泛的应用. 由于文本数据大多属于非结构化的数据,要想对文本数据进行传统的统计模型分析,必须要经过层层的数据清洗与整理. 今天我们要介绍的『正则表达式及R字符串处理』就是用来干这一种脏活累活的. 与建立酷炫的模型比起来,数据的清洗与整理似乎是一种低档次的工作.如果把建

R语言学习笔记

參考:W.N. Venables, D.M. Smith and the R DCT: Introduction to R -- Notes on R: A Programming Environment for Data Analysis and Graphics,2003. http://bayes.math.montana.edu/Rweb/Rnotes/R.html 前言:关于R 在R的官方教程里是这么给R下注解的:一个数据分析和图形显示的程序设计环境(A system for data

D3D9 GPU Hacks (转载)

D3D9 GPU Hacks I’ve been trying to catch up what hacks GPU vendors have exposed in Direct3D9, and turns out there’s a lot of them! If you know more hacks or more details, please let me know in the comments! Most hacks are exposed as custom (“FOURCC”)

GPU 优化总结

前面说了对我这一年多的工作进行一个总结,由于工作比较紧,加上本人比较懒,一直没能抽出时间来写,最近稍微闲下来了.先写一篇GPU优化的,后续的文章希望能慢慢补齐.这些基本都是我个人优化的实际经验,也参考了一些文章,我都放在后面引用 部分了,感兴趣的可以深入研究.个人理解可能有问题,如有不正确的还请指正,下面进入正题. 由于图形引擎的复杂性,瓶颈可能发生在CPU.GPU.,也可能发生在CPU与GPU的传输数据与交互之中.这里我们只假设瓶颈在GPU上,讨论GPU的优化方法. Premature opt

【GPU加速系列】PyCUDA(一):上手简单操作

PyCUDA 可以通过 Python 访问 Navidia 的 CUDA 并行计算 API. 具体介绍和安装可以参考 PyCUDA 官网文档和 pycuda PyPI. 本文涵盖的内容有: 通过 PyCUDA 查询 GPU 信息. NumPy array 和 gpuarray 之间的相互转换. 使用 gpuarray 进行基本的运算. 使用 ElementwiseKernel 进行按元素的运算. 使用 InclusiveScanKernel 和 ReductionKernel 的 reduce

ML简史

原文地址:http://www.52ml.net/15427.html 图 1 机器学习时间线 在科学技术刚刚萌芽的时候,科学家Blaise Pascal和Von Leibniz就想到了有朝一日能够实现人工智能.即让机器拥有像人一样的智能. 机器学习是AI中一条重要的发展线,在工业界和学术界都异常火爆.企业.大学都在投入大量的资源来做机器学习方面的研究.最近,机器学习在很多任务上都有了重大的进步,达到或者超越了人类的水平(例如,交通标志的识别[1],ML达到了98.98%,已超越了人类). 图1