Running R jobs quickly on many machines(转)

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.


Colossus supercomputer : The Forbin Project

R itself is not a language designed for parallel computing. It doesn’t have a lot of great user exposed parallel constructs. What saves us is the data science tasks we tend to use R for are themselves are very well suited for parallel programming and many people have prepared very goodpragmatic libraries to exploit this. There are three main ways for a user to benefit from library supplied parallelism:

  • Link against superior and parallel libraries such as the Intel BLAS library (supplied on Linux, OSX, and Windows as part of theMicrosoft R Open distribution of R). This replaces libraries you are already using with parallel ones, and you get a speed up for free (on appropriate tasks, such as linear algebra portions of lm()/glm()).
  • Ship your modeling tasks out of R into an external parallel system for processing. This is strategy of systems such as rx methods from RevoScaleR, now Microsoft Open Rh2o methods from h2o.ai, orRHadoop.
  • Use R’s parallel facility to ship jobs to cooperating R instances.This is the strategy used in “A gentle introduction to parallel computing in R” and many libraries that sit on top of parallel. This is essentially implementing remote procedure call through sockets or networking.

We are going to write more about the third technique.

The third technique is essentially very course grained remote procedure call. It depends on shipping copies of code and data to remote processes and then returning results. It is ill suited for very small tasks. But very well suited a reasonable number of moderate to large tasks. This is the strategy used by R’s parallel library and Python‘s multiprocessinglibrary (though with Python multiprocessing you pretty much need to bring in additional libraries to move from single machine to cluster computing).

This method may seem less efficient and less sophisticated than shared memory methods, but relying on object transmission means it is in principle very easy to extend the technique from a single machine to many machines (also called “cluster computing”). This is what we will demonstrate the R portion of here (in moving from a single machine to a cluster we necessarily bring in a lot of systems/networking/security issues which we will have to defer on).

Here is the complete R portion of the lesson. This assumes you already understand how to configure “ssh” or have a systems person who can help you with the ssh system steps.

Take the examples from “A gentle introduction to parallel computing in R” and instead of starting your parallel cluster with the command: “parallelCluster <- parallel::makeCluster(parallel::detectCores()).”

Do the following:

Collect a list of addresses of machines you can ssh. This is the hard part, depends on your operating system, and something you should get help with if you have not tried it before. In this case I am using ipV4 addresses, but when using Amazon EC2 I use hostnames.

In my case my list is:

  • My machine (primary): “192.168.1.235”, user “johnmount”
  • Another Win-Vector LLC machine: “192.168.1.70”, user “johnmount”

Notice we are not collecting passwords, as we are assuming we have set up proper “authorized_keys” and keypairs in the “.ssh” configurations of all of these machines. We are calling the machine we are using to issue the overall computation “primary.”

It is vital you try all of these addresses with “ssh” in a terminal shell before trying them with R. Also the machine address you choose as “primary” must be an address the worker machines can use reach back to the primary machine (so you can’t use “localhost”, or use an unreachable machine as primary). Try ssh by hand back and forth from primary to all of these machines and from all of these machines back to your primary before trying to use ssh with R.

Now with the system stuff behind us the R part is as follows. Start your cluster with:

primary <- ‘192.168.1.235‘
machineAddresses <- list(
  list(host=primary,user=‘johnmount‘,
       ncore=4),
  list(host=‘192.168.1.70‘,user=‘johnmount‘,
       ncore=4)
)

spec <- lapply(machineAddresses,
               function(machine) {
                 rep(list(list(host=machine$host,
                               user=machine$user)),
                     machine$ncore)
               })
spec <- unlist(spec,recursive=FALSE)

parallelCluster <- parallel::makeCluster(type=‘PSOCK‘,
                                         master=primary,
                                         spec=spec)
print(parallelCluster)
## socket cluster with 8 nodes on hosts
##                   ‘192.168.1.235’, ‘192.168.1.70’

And that is it. You can now run your job on many cores on many machines. For the right tasks this represents a substantial speedup. As always separate your concerns when starting: first get a trivial “hello world” task to work on your cluster, then get a smaller version of your computation to work on a local machine, and only after these throw your real work at the cluster.

As we have mentioned before, with some more system work you canspin up transient Amazon ec2 instances to join your computation. At this point your credit card becomes a supercomputer (though you do have to remember to shut them down to prevent extra expenses!).

转自:http://www.win-vector.com/blog/2016/01/running-r-jobs-quickly-on-many-machines/

时间: 2024-10-15 04:39:30

Running R jobs quickly on many machines(转)的相关文章

Scheduled Jobs with Custom Clock Processes in Java with Quartz and RabbitMQ

原文地址: https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes-java-quartz-rabbitmq Table of Contents Prerequisites Scheduling jobs with Quartz Queuing jobs with RabbitMQ Processing jobs Running on Heroku Further learning There are

[SQL in Azure] Getting Started with SQL Server in Azure Virtual Machines

This topic provides guidelines on how to sign up for SQL Server on a Azure virtual machine and how to get started creating SQL Server databases in Microsoft public cloud environment. With SQL Server in Azure Virtual Machines, you get the full benefit

An A-Z Index of the Apple OS X command line

来源:http://ss64.com/osx/ An A-Z Index of the Apple OS X command line alias Create an alias • alloc List used and free memory apropos Search the whatis database for strings asr Apple Software Restore awk Find and Replace text within file(s) b basename

dbms_scheduler vs dbms_job

Although dbms_job still exists in 10gand 11g, Oracle recommends the use of dbms_scheduler in releases 10g and up. Nonew features are being added to dbms_job and you will likely quickly run intoits limitations. dbms_scheduler is more robust andfully-f

推荐一个很好用的脚本session snapper

源网址http://tech.e2sn.com/oracle-scripts-and-tools/session-snapper 内容: If you want to just download Snapper, you can get it from here: http://blog.tanelpoder.com/files/scripts/snapper.sql(please right click on the file and use Save as... instead of cop

NetCore开源项目集合

具体见:https://github.com/thangchung/awesome-dotnet-core 半年前看到的,今天又看到了,记录下. General ASP.NET Core Documentation - The official ASP.NET Core documentation site. .NET Core Documentation - Home of the technical documentation for .NET Core, C#, F# and Visual

Hadoop集群安装--Ubuntu

我家宝最近在自学Hadoop,于是乎跟着一起玩玩,在此为她整理一篇基础搭建的博客,希望对她能有所帮助.同样,开始之前先来了解下,什么是Hadoop. Hadoop是一个由Apache基金会所开发的分布式系统基础架构.它是根据Google公司发表的MapReduce和Google文件系统的论文自行实现而成.Hadoop框架透明地为应用提供可靠性和数据移动.它实现了名为MapReduce的编程范式:应用程序被分区成许多小部分,而每个部分都能在集群中的任意节点上运行或重新运行. Hadoop实现了一个

一个监视文件夹日期的脚本

::这个脚本用计划任务执行后,可以对WINDOWS Backup的文件夹进行监视,如果没有新的备份产生,则发送EMAIL到指定的地址. ::脚本允许一次事件,第二次事件后才发送告警. @echo off SETLOCAL SETLOCAL ENABLEDELAYEDEXPANSION SET destinationfolder=H:\WindowsImageBackup\MBM-SBS set emailServer=smtp.yourISP.com SET [email protected]

linux-syslog,ssh,ssh-keygen,dropbear

Linux上的日志系统syslogsyslog-ng: 开源 日志系统:syslog() A: B:D: syslog服务进程: syslogd: 系统,非内核产生的信息 klogd:内核,专门负责记录内核产生的日志信息 kernel --> 物理终端(/dev/console) --> /var/log/dmesg# dmesg# cat /var/log/dmesg /sbin/init  /var/log/messages: 系统标准错误日志信息:非内核产生引导信息:各子系统产生的信息: