340 likes | 348 Views
This project aims to develop a runtime framework for parallelizing R, providing automatic and transparent parallel programming capabilities. The goal is to achieve speedup and scalability for R applications, benefiting the R community users.
E N D
pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language Jiangtian Li Department of Computer Science North Carolina State University
Acknowledgement • This project is originated from and in collaboration with Dr. Samatova’s group at Oak Ridge National Lab • Dr. Nagiza Samatova • Guru Kora • Srikanth Yoginath • Advisors • Dr. Xiaosong Ma • Dr. Nagiza Samatova • Supported by grants from • NSF • DOE csc801 seminar fall2007
Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07
Motivation • Increasing demand of massive scientific data processing • Statistical analysis in gene/protein data (61 billions sequence records in GenBank) • Time series analysis of climate data (~300GB for 10 years) • Widely used computing tools such as R, Matlab are interpreted language in nature • Facilitate runtime parallelization • Involve both computation-intensive and data-intensive tasks • Can exploit both task and data parallelism csc801 seminar fall2007
What is R? • Portable and extensible software as well as an interpreted language • Lisp alike - read-eval-print loop • Perform diverse statistical analysis • Many extension packages are being developed • Can be used in either interactive mode or batch mode csc801 seminar fall2007
Example R script: “example.R” #Assign an integer a <-1 # Construct a vector of 9 real numbers # conforming to normal distribution c <- rnorm(9) # Initialize a two-dimensional array d <- array(0:0, dim=c(9,9)); # Loop, read data from file for(i in 1:length(c)){ d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } csc801 seminar fall2007
Example – batch mode execution From R prompt >source("example.R") >a [1] 1 >c [1] 1.16808 0.15877 1.40785 1.73696 -1.19267 0.41321 [7] -0.39817 -0.13059 -0.67247 >d [,1] [,2] [,3] [,4] [,5] … [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 … From shell R CMD BATCH example.R csc801 seminar fall2007
Research Goal • Propose runtime framework for parallelizing R • Provide automatic and transparent manner for parallel R programming • Achieve speedup and scalability for R applications and benefit R community users csc801 seminar fall2007
Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07
Related Work • Embarrassingly parallel • snow package - Rossini et al. • Message passing • MultiMATLAB - Trefethen et al. • pyMPI - Miller • Back-end support • RScaLAPACK - Yoginath et al. • Star-P - Choy et al. • Compilers • Otter - Quinn et al. • Shared memory • MATmarks –Almasi et al. csc801 seminar fall2007
Related Work • Parallelizing compilers • SUIF – Hall et al. • Polaris - Blume et al. • Runtime parallelization • Jprm - Chen et al. • Dynamic compilation • DyC - Grant et al. csc801 seminar fall2007
Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07
Design Rationale • Most R codes consist of high-level pre-built functions, e.g., svd for singular value decomposition, eigen for eigenvalues and eigenvector computation • Loops usually has less inter-iteration dependency and higher per-iteration execution cost, e.g., R applications from Bioconductor • No pointer, no aliasing problem csc801 seminar fall2007
Approach • Selective parallelizing scheme that focus on function calls and loops • Dynamic and incremental dependency analysis with runtime evaluation – pause where dependency cannot be determined, such as dynamic loop bound, conditional branch • Master-worker paradigm to reduce scheduling and data communication overhead • “Outsource” expensive tasks, i.e., function calls and loops to workers • Data are distributed at workers csc801 seminar fall2007
Framework Architecture • Inter-node communication – MPI • Inter-process communication – domain socket csc801 seminar fall2007
Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07
Analyzer • Input – R script • Output – Task Precedence Graph • Task – finest unit in scheduling • Identify precedence relationship among tasks csc801 seminar fall2007
Parsing • Identify basic execution unit – R statement • Retrieve expressions such as variable names, array subscripts • Output parse tree csc801 seminar fall2007
An example of parse tree csc801 seminar fall2007
Dependence analysis • Identify task – finest unit in scheduling • Statement dependence analysis • Loop dependence analysis – GCD test • Incremental analysis • Pause at points where runtime information is needed for dependence analysis or branch decision • Obtain runtime evaluation results and proceed • Output Task Precedence Graph • Vertex – task • Edge - dependence csc801 seminar fall2007
Loop parallelization • Parallelize loop if no dependence is discovered • Executed in an embarrassingly parallel manner • Adjust Task Precedence Graph csc801 seminar fall2007
An running example csc801 seminar fall2007
task 1 task 2 task 3 a <- 1 b <- 2 c <- rnorm(9) d <- array(0:0, dim=c(9,9)) task 5 ll task 4 ll for (i in 1:5) { d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } for (i in b:length(c)) { c[i] <- c[i-1] + a } for (i in 2:9) { c[i] <- c[i-1] + a } for (i in 1:lenth(c)) { d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } ll if (c[length(c)] > 10) { e <- eigen(d) } else { e <- sum(c) } task 6 task 6 ll Pause point csc801 seminar fall2007
Parallel Execution Engine • Dispatch “ready” tasks • Outsource expensive tasks (loops or function calls) to workers • Coordinate peer-to-peer data communication and monitor execution status • Update analyzer with runtime results csc801 seminar fall2007
Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07
Ease of use demonstration • Comparison of pR and snow (an R add-on package) • pR – no user interference of source code • snow – user plugs in APIs csc801 seminar fall2007
Performance • Testbed • Opt cluster: 16 nodes, 2 core, dual Opteron 265, 1 Gbps Ether • Fedora Core 5 Linux x86_64(Linux Kernel 2.6.16) • Benchmarks • Boost – a statistics application • Bootstrap • SVD csc801 seminar fall2007
Boost • Analysis overhead is very small • From 16 to 32 processors, computation speedup drops to 1.5 csc801 seminar fall2007
Boostrap csc801 seminar fall2007
SVD • Analysis overhead is very small • Serialization large data set in R is major overhead (1.9 MB/s) csc801 seminar fall2007
Task Parallelism Test • Statistical functions • prcomp – principal component analysis • svd – singular value decomposition • lm.fit – linear model fitting • cor – variance computation • fft – Fast Fourier Transform • qr – QR decomposition • Execution time of each task ranges from 3-27 seconds csc801 seminar fall2007
Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07
Future work • Apply loop transformation techniques • Intelligent scheduling to exploit data locality • Explore finer granularity – interprocedural parallelization • Load balance • Optimize high-level R function such as serialization csc801 seminar fall2007
Conclusion • Present pR framework, the first step to parallelize R automatically and transparently • Optimization is needed to improve efficiency csc801 seminar fall2007