pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language

pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language Jiangtian Li Department of Computer Science North Carolina State University

Acknowledgement • This project is originated from and in collaboration with Dr. Samatova’s group at Oak Ridge National Lab • Dr. Nagiza Samatova • Guru Kora • Srikanth Yoginath • Advisors • Dr. Xiaosong Ma • Dr. Nagiza Samatova • Supported by grants from • NSF • DOE csc801 seminar fall2007

Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

Motivation • Increasing demand of massive scientific data processing • Statistical analysis in gene/protein data (61 billions sequence records in GenBank) • Time series analysis of climate data (~300GB for 10 years) • Widely used computing tools such as R, Matlab are interpreted language in nature • Facilitate runtime parallelization • Involve both computation-intensive and data-intensive tasks • Can exploit both task and data parallelism csc801 seminar fall2007

What is R? • Portable and extensible software as well as an interpreted language • Lisp alike - read-eval-print loop • Perform diverse statistical analysis • Many extension packages are being developed • Can be used in either interactive mode or batch mode csc801 seminar fall2007

Example R script: “example.R” #Assign an integer a <-1 # Construct a vector of 9 real numbers # conforming to normal distribution c <- rnorm(9) # Initialize a two-dimensional array d <- array(0:0, dim=c(9,9)); # Loop, read data from file for(i in 1:length(c)){ d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } csc801 seminar fall2007

Example – batch mode execution From R prompt >source("example.R") >a [1] 1 >c [1] 1.16808 0.15877 1.40785 1.73696 -1.19267 0.41321 [7] -0.39817 -0.13059 -0.67247 >d [,1] [,2] [,3] [,4] [,5] … [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 … From shell R CMD BATCH example.R csc801 seminar fall2007

Research Goal • Propose runtime framework for parallelizing R • Provide automatic and transparent manner for parallel R programming • Achieve speedup and scalability for R applications and benefit R community users csc801 seminar fall2007

Related Work • Embarrassingly parallel • snow package - Rossini et al. • Message passing • MultiMATLAB - Trefethen et al. • pyMPI - Miller • Back-end support • RScaLAPACK - Yoginath et al. • Star-P - Choy et al. • Compilers • Otter - Quinn et al. • Shared memory • MATmarks –Almasi et al. csc801 seminar fall2007

Related Work • Parallelizing compilers • SUIF – Hall et al. • Polaris - Blume et al. • Runtime parallelization • Jprm - Chen et al. • Dynamic compilation • DyC - Grant et al. csc801 seminar fall2007

Design Rationale • Most R codes consist of high-level pre-built functions, e.g., svd for singular value decomposition, eigen for eigenvalues and eigenvector computation • Loops usually has less inter-iteration dependency and higher per-iteration execution cost, e.g., R applications from Bioconductor • No pointer, no aliasing problem csc801 seminar fall2007

Approach • Selective parallelizing scheme that focus on function calls and loops • Dynamic and incremental dependency analysis with runtime evaluation – pause where dependency cannot be determined, such as dynamic loop bound, conditional branch • Master-worker paradigm to reduce scheduling and data communication overhead • “Outsource” expensive tasks, i.e., function calls and loops to workers • Data are distributed at workers csc801 seminar fall2007

Framework Architecture • Inter-node communication – MPI • Inter-process communication – domain socket csc801 seminar fall2007

Analyzer • Input – R script • Output – Task Precedence Graph • Task – finest unit in scheduling • Identify precedence relationship among tasks csc801 seminar fall2007

Parsing • Identify basic execution unit – R statement • Retrieve expressions such as variable names, array subscripts • Output parse tree csc801 seminar fall2007

An example of parse tree csc801 seminar fall2007

Dependence analysis • Identify task – finest unit in scheduling • Statement dependence analysis • Loop dependence analysis – GCD test • Incremental analysis • Pause at points where runtime information is needed for dependence analysis or branch decision • Obtain runtime evaluation results and proceed • Output Task Precedence Graph • Vertex – task • Edge - dependence csc801 seminar fall2007

Loop parallelization • Parallelize loop if no dependence is discovered • Executed in an embarrassingly parallel manner • Adjust Task Precedence Graph csc801 seminar fall2007

An running example csc801 seminar fall2007

task 1 task 2 task 3 a <- 1 b <- 2 c <- rnorm(9) d <- array(0:0, dim=c(9,9)) task 5 ll task 4 ll for (i in 1:5) { d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } for (i in b:length(c)) { c[i] <- c[i-1] + a } for (i in 2:9) { c[i] <- c[i-1] + a } for (i in 1:lenth(c)) { d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } ll if (c[length(c)] > 10) { e <- eigen(d) } else { e <- sum(c) } task 6 task 6 ll Pause point csc801 seminar fall2007

Parallel Execution Engine • Dispatch “ready” tasks • Outsource expensive tasks (loops or function calls) to workers • Coordinate peer-to-peer data communication and monitor execution status • Update analyzer with runtime results csc801 seminar fall2007

Ease of use demonstration • Comparison of pR and snow (an R add-on package) • pR – no user interference of source code • snow – user plugs in APIs csc801 seminar fall2007

Performance • Testbed • Opt cluster: 16 nodes, 2 core, dual Opteron 265, 1 Gbps Ether • Fedora Core 5 Linux x86_64(Linux Kernel 2.6.16) • Benchmarks • Boost – a statistics application • Bootstrap • SVD csc801 seminar fall2007

Boost • Analysis overhead is very small • From 16 to 32 processors, computation speedup drops to 1.5 csc801 seminar fall2007

Boostrap csc801 seminar fall2007

SVD • Analysis overhead is very small • Serialization large data set in R is major overhead (1.9 MB/s) csc801 seminar fall2007

Task Parallelism Test • Statistical functions • prcomp – principal component analysis • svd – singular value decomposition • lm.fit – linear model fitting • cor – variance computation • fft – Fast Fourier Transform • qr – QR decomposition • Execution time of each task ranges from 3-27 seconds csc801 seminar fall2007

Future work • Apply loop transformation techniques • Intelligent scheduling to exploit data locality • Explore finer granularity – interprocedural parallelization • Load balance • Optimize high-level R function such as serialization csc801 seminar fall2007

Conclusion • Present pR framework, the first step to parallelize R automatically and transparently • Optimization is needed to improve efficiency csc801 seminar fall2007

pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language

pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language

Presentation Transcript

On the Interaction of Tiling and Automatic Parallelization

Automatic Parallelization of Divide and Conquer Algorithms

Effective Automatic Parallelization of Stencil Computations *

Automatic Parallelization

Python Scripting Language

Introduction to Scripting Language

The Common Language Runtime (CLR)

PHP Scripting language

Common Language Runtime

The R Language

SCRIPTING LANGUAGE

PHP Scripting Language

Common Language Runtime

Formalization of Generics for the .NET Common Language Runtime

Scalable and transparent parallelization of multiplayer games

The Common Language Runtime (CLR)

Tcl Scripting Language

Integrating the R Language Runtime System with a Data Stream Warehouse

The Common Language Runtime (CLR)

Introducing the Common Language Runtime