1 / 34

pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language

This project aims to develop a runtime framework for parallelizing R, providing automatic and transparent parallel programming capabilities. The goal is to achieve speedup and scalability for R applications, benefiting the R community users.

meltonm
Download Presentation

pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. pR: Automatic, Transparent Runtime Parallelization of the R Scripting Language Jiangtian Li Department of Computer Science North Carolina State University

  2. Acknowledgement • This project is originated from and in collaboration with Dr. Samatova’s group at Oak Ridge National Lab • Dr. Nagiza Samatova • Guru Kora • Srikanth Yoginath • Advisors • Dr. Xiaosong Ma • Dr. Nagiza Samatova • Supported by grants from • NSF • DOE csc801 seminar fall2007

  3. Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

  4. Motivation • Increasing demand of massive scientific data processing • Statistical analysis in gene/protein data (61 billions sequence records in GenBank) • Time series analysis of climate data (~300GB for 10 years) • Widely used computing tools such as R, Matlab are interpreted language in nature • Facilitate runtime parallelization • Involve both computation-intensive and data-intensive tasks • Can exploit both task and data parallelism csc801 seminar fall2007

  5. What is R? • Portable and extensible software as well as an interpreted language • Lisp alike - read-eval-print loop • Perform diverse statistical analysis • Many extension packages are being developed • Can be used in either interactive mode or batch mode csc801 seminar fall2007

  6. Example R script: “example.R” #Assign an integer a <-1 # Construct a vector of 9 real numbers # conforming to normal distribution c <- rnorm(9) # Initialize a two-dimensional array d <- array(0:0, dim=c(9,9)); # Loop, read data from file for(i in 1:length(c)){ d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } csc801 seminar fall2007

  7. Example – batch mode execution From R prompt >source("example.R") >a [1] 1 >c [1] 1.16808 0.15877 1.40785 1.73696 -1.19267 0.41321 [7] -0.39817 -0.13059 -0.67247 >d [,1] [,2] [,3] [,4] [,5] … [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 … From shell R CMD BATCH example.R csc801 seminar fall2007

  8. Research Goal • Propose runtime framework for parallelizing R • Provide automatic and transparent manner for parallel R programming • Achieve speedup and scalability for R applications and benefit R community users csc801 seminar fall2007

  9. Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

  10. Related Work • Embarrassingly parallel • snow package - Rossini et al. • Message passing • MultiMATLAB - Trefethen et al. • pyMPI - Miller • Back-end support • RScaLAPACK - Yoginath et al. • Star-P - Choy et al. • Compilers • Otter - Quinn et al. • Shared memory • MATmarks –Almasi et al. csc801 seminar fall2007

  11. Related Work • Parallelizing compilers • SUIF – Hall et al. • Polaris - Blume et al. • Runtime parallelization • Jprm - Chen et al. • Dynamic compilation • DyC - Grant et al. csc801 seminar fall2007

  12. Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

  13. Design Rationale • Most R codes consist of high-level pre-built functions, e.g., svd for singular value decomposition, eigen for eigenvalues and eigenvector computation • Loops usually has less inter-iteration dependency and higher per-iteration execution cost, e.g., R applications from Bioconductor • No pointer, no aliasing problem csc801 seminar fall2007

  14. Approach • Selective parallelizing scheme that focus on function calls and loops • Dynamic and incremental dependency analysis with runtime evaluation – pause where dependency cannot be determined, such as dynamic loop bound, conditional branch • Master-worker paradigm to reduce scheduling and data communication overhead • “Outsource” expensive tasks, i.e., function calls and loops to workers • Data are distributed at workers csc801 seminar fall2007

  15. Framework Architecture • Inter-node communication – MPI • Inter-process communication – domain socket csc801 seminar fall2007

  16. Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

  17. Analyzer • Input – R script • Output – Task Precedence Graph • Task – finest unit in scheduling • Identify precedence relationship among tasks csc801 seminar fall2007

  18. Parsing • Identify basic execution unit – R statement • Retrieve expressions such as variable names, array subscripts • Output parse tree csc801 seminar fall2007

  19. An example of parse tree csc801 seminar fall2007

  20. Dependence analysis • Identify task – finest unit in scheduling • Statement dependence analysis • Loop dependence analysis – GCD test • Incremental analysis • Pause at points where runtime information is needed for dependence analysis or branch decision • Obtain runtime evaluation results and proceed • Output Task Precedence Graph • Vertex – task • Edge - dependence csc801 seminar fall2007

  21. Loop parallelization • Parallelize loop if no dependence is discovered • Executed in an embarrassingly parallel manner • Adjust Task Precedence Graph csc801 seminar fall2007

  22. An running example csc801 seminar fall2007

  23. task 1 task 2 task 3 a <- 1 b <- 2 c <- rnorm(9) d <- array(0:0, dim=c(9,9)) task 5 ll task 4 ll for (i in 1:5) { d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } for (i in b:length(c)) { c[i] <- c[i-1] + a } for (i in 2:9) { c[i] <- c[i-1] + a } for (i in 1:lenth(c)) { d[i,] <- matrix(scan(paste(“test.data”, i, sep=“”))) } ll if (c[length(c)] > 10) { e <- eigen(d) } else { e <- sum(c) } task 6 task 6 ll Pause point csc801 seminar fall2007

  24. Parallel Execution Engine • Dispatch “ready” tasks • Outsource expensive tasks (loops or function calls) to workers • Coordinate peer-to-peer data communication and monitor execution status • Update analyzer with runtime results csc801 seminar fall2007

  25. Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

  26. Ease of use demonstration • Comparison of pR and snow (an R add-on package) • pR – no user interference of source code • snow – user plugs in APIs csc801 seminar fall2007

  27. Performance • Testbed • Opt cluster: 16 nodes, 2 core, dual Opteron 265, 1 Gbps Ether • Fedora Core 5 Linux x86_64(Linux Kernel 2.6.16) • Benchmarks • Boost – a statistics application • Bootstrap • SVD csc801 seminar fall2007

  28. Boost • Analysis overhead is very small • From 16 to 32 processors, computation speedup drops to 1.5 csc801 seminar fall2007

  29. Boostrap csc801 seminar fall2007

  30. SVD • Analysis overhead is very small • Serialization large data set in R is major overhead (1.9 MB/s) csc801 seminar fall2007

  31. Task Parallelism Test • Statistical functions • prcomp – principal component analysis • svd – singular value decomposition • lm.fit – linear model fitting • cor – variance computation • fft – Fast Fourier Transform • qr – QR decomposition • Execution time of each task ranges from 3-27 seconds csc801 seminar fall2007

  32. Outline • Motivation • Background • Architecture • Design • Performance • Conclusion and Future Work √ csc801 seminar fall07

  33. Future work • Apply loop transformation techniques • Intelligent scheduling to exploit data locality • Explore finer granularity – interprocedural parallelization • Load balance • Optimize high-level R function such as serialization csc801 seminar fall2007

  34. Conclusion • Present pR framework, the first step to parallelize R automatically and transparently • Optimization is needed to improve efficiency csc801 seminar fall2007

More Related