R with Distributed Systems

R with Distributed Systems

R with Distributed System • RHIPE - R and Hadoop Integrated Processing Environment • http://www.stat.purdue.edu/~sguha/rhipe/ • Ricardo: Integrating R and Hadoop, SIGMOD 2010 • Segue • http://code.google.com/p/segue/ • HadoopInteractiVE • https://r-forge.r-project.org/projects/rhadoop/ • Big Data Analysis with Revolution R Enterprise • Revolution R Enterprise • http://www.revolutionanalytics.com/ • The RevoScaleR package provides a mechanism for scaling the R language to handle very large data sets. • Elastic-R • https://www.elastic-r.org • Biopara • http://hedwig.mgh.harvard.edu/biostatistics/node/20 • http://hedwig.mgh.harvard.edu/biostatistics/files/biopara/biopara.html • RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009 • R adopted a relational database as a backend, not Hadoop

Ricardo: Integrating R and Hadoop SudiptoDas*, YannisSismanis**, Kevin S Beyer**, Rainer Gemulla**, Peter J. Haas**, John McPherson** * UC Santa Barbara ** IBM Almaden Research Center SIGMOD 2010

Deep Analytics on Big Data • Enterprises collect huge amounts of data • Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … • User interaction data and history • Click and Transaction logs • Deep analysis critical for competitive edge • Understanding/Modeling data • Recommendations to users • Ad placement • Challenge: Enable Deep Analysis and Understanding over massive data volumes • Exploiting data to its full potential

Motivating Examples • Data Exploration/Model Evaluation/Outlier Detection • Personalized Recommendations • For each individual customer/product • Many applications to Netflix, Amazon, eBay, iTunes, … • Difficulty: Discern particular customer preferences • Sampling loses Competitive advantage • Application Scenario: Movie Recommendations, Netflix • Millions of Customers • Hundreds of thousands of Movies • Billions of Movie Ratings

Big Data and Deep Analytics – The Gap • R, SPSS, SAS – A Statistician’s toolbox • Rich statistical, modeling, visualization functionality • Operate on small data amounts entirely in memory • Extensions for data handling cumbersome • Hadoop – Scalable Data Management Systems • Scalable, Fault-Tolerant, Elastic, … • “Magnetic”: easy to store data • Limited deep analytics: mostly descriptive analytics

Filling the Gap: Existing Approaches • Reducing Data size by Sampling • Approximations might result in losing competitive advantage • Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009] • Scaling out R • Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] • Main memory based in most cases • Re-implementing DBMS and distributed processing functionality • Deep Analysis within a DBMS • Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] • Not Sustainable – missing out from R’s community development and rich libraries

Ricardo: Bridging the Gap • David Ricardo, famous economist from 19th century • “Comparative Advantage” • Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] • Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA • Recommender Systems/Latent Factorization [in the paper] • Large-part includes joins, group bys, distributive aggregations • Hadoop + Jaql: excellent scalability to large-scale data management • Small-part includes matrix/vector operations • R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc. • Ricardo: Establishes “trade” between R and Hadoop/Jaql

R in a Nutshell • R supports Rich statistical functionality

Jaql in a Nutshell • Scalable Descriptive Analysis using Hadoop • Jaql a representative declarative interface • JSON View of the data: • JaqlExample:

Ricardo: The Trading Architecture • Complexity of Trade between R and Hadoop • Simple Trading: Data Exploration • Complex Trading: Data Modeling

Simple Trading: Exploratory Analytics • Gain insights about data • Example - top-k outliers for a model • Identify data items on which the model performed most poorly • Helpful for improving accuracy of model • The trade: • Use complex statistical models using rich R functionality • Parallelize processing over entire data using Hadoop/Jaql

Complex Trading: Latent Factors • SVD-like matrix factorization • Minimize Square Error: Σi,j (piqj - rij)2 • The trade: • Use complex statistical models in R • Parallelize aggregate computations using Hadoop/Jaql q p

Latent Factor Models with Ricardo • Goal • Minimize Square Error: e = Σi,j (piqj - rij)2 • Numerical methods needed (large, sparse matrix) • Pseudocode • Start with initial guess of parameters piand qj • Compute error & gradient • e.g., de/dpi= Σj 2qj (piqj- rij) • (Data intensive, but parallelizable) • Update parameters • R implements many different optimization algorithms • Repeat steps 2 and 3 until convergence. • R code • optim( c(p,q), fe, fde, method="L-BFGS-B" )

Computing the Model e = Σi,j(piqj - rij)2 3 way join to matchrij, pi, and qj, then aggregate Movie Parameters Customer Parameters Similarly compute the gradients Movie Ratings

Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ] group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient ---- ---- -------- null null 325235 1 null 21 2 null 357 … null 1 9 null 2 64 … Result in R

Experimental Evaluation • 50 nodes at EC2 • Each node: 8 cores, 7GB Memory, 320GB Disk • Total: 400 cores, 320GB Memory, 70TB Disk Space

Result • Leveraging Hadoop’sScalability • Leveraging R’s Rich Functionality • optim( c(p,q), fe, fde, method=“CG" ) • optim( c(p,q), fe, fde, method="L-BFGS-B" )

Extending the Trade: R – Jaql – R • Invoking R through Jaql – distributed statistical computation • Example: Augment model with changing customer preferences with time • Time series model for each customer incorporated into global model

Conclusion • Scaled Latent Factor Models to Terabytes of data • Provided a bridge for other algorithms with Summation Form can be mapped and scaled • Many Algorithms have Summation Form • Decompose into “large part” and “small part” • [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM • Future & Current Work • Tighter language integration • More algorithms • Performance tuning

RHIPE - R and Hadoop Integrated Processing Environment SaptarshiGuha

RHIPE • R package • INSTALL • Set an environment variable $HADOOP that points to the Hadoopinstallation directory. • It is expected that $HADOOP\bin contains the Hadoop shell executable hadoop • This needs to be installed on all the computers: the one you run your R environment and all the task computers. • Use RHIPE is much easier if your filesystem layout (i.e location of R, Hadoop, libraries etc) is identical across all computers.

Tests • In R • should work successfully • should successfully write the list to the HDFS • should return a list of length 3 each element a list of 2 objects.

Tests (cont’d) • A quick run of this should also work

R and Hadoop Integrated Programming Environment • The R and Hadoop Integrated Programming Environment is R package • compute across massive data sets • create subsets • apply routines to subsets • produce displays on subsets across a cluster of computers • using the Hadoop DFS and HadoopMapReduce framework. • Use Hadoop Streaming • Users can write MapReduceprograms in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. • HadoopStreaming then transfers the input data from Hadoop to the user program and vice versa.

R and Hadoop Integrated Programming Environment • RHIPE is just that. • RHIPE consist of several functions to interact with the HDFS • e.g. save data sets, read data created by RHIPE MapReduce, delete files. • Commands in R • Compose and launch MapReduce jobs from R using the command rhmr and rhex. • Monitor the status using rhstatus which returns an R object. • Stop jobs using rhkill • Compute side effect files. • The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. • These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system. • Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. • The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets. • Data sets created using RHIPE are key-value pairs. • A key is mapped to a value. A MapReduce computations iterates over the key, value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external-memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

Example: Airline Dataset • Copying the Data to the HDFS

Example: Airline Dataset (cont’d) • rhstatus

Example: Airline Dataset (cont’d) • Job

Example: Airline Dataset (cont’d) • Demonstration of using Hadoop as a Queryable Database

Demonstration of using Hadoop as a Queryable Database • Top 20 cities by total volume of flights.

Example: Transforming Text Data • Text data • The carrier name is column 9. • Southwest carrier code is WN, Delta is DL. • Only those rows with column 9 equal to WN or DL will be saved.

Example: Transforming Text Data (cont’d) • The output • 1 • 2

R with Distributed Systems

R with Distributed Systems

Presentation Transcript

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Software Systems with CORBA

Distributed Systems with JINI

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed Multimedia Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems