330 likes | 478 Views
R with Distributed Systems. R with Distributed System. RHIPE - R and Hadoop Integrated Processing Environment http://www.stat.purdue.edu/~sguha/rhipe / Ricardo: Integrating R and Hadoop , SIGMOD 2010 Segue http://code.google.com/p/segue/ Hadoop InteractiVE
E N D
R with Distributed System • RHIPE - R and Hadoop Integrated Processing Environment • http://www.stat.purdue.edu/~sguha/rhipe/ • Ricardo: Integrating R and Hadoop, SIGMOD 2010 • Segue • http://code.google.com/p/segue/ • HadoopInteractiVE • https://r-forge.r-project.org/projects/rhadoop/ • Big Data Analysis with Revolution R Enterprise • Revolution R Enterprise • http://www.revolutionanalytics.com/ • The RevoScaleR package provides a mechanism for scaling the R language to handle very large data sets. • Elastic-R • https://www.elastic-r.org • Biopara • http://hedwig.mgh.harvard.edu/biostatistics/node/20 • http://hedwig.mgh.harvard.edu/biostatistics/files/biopara/biopara.html • RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009 • R adopted a relational database as a backend, not Hadoop
Ricardo: Integrating R and Hadoop SudiptoDas*, YannisSismanis**, Kevin S Beyer**, Rainer Gemulla**, Peter J. Haas**, John McPherson** * UC Santa Barbara ** IBM Almaden Research Center SIGMOD 2010
Deep Analytics on Big Data • Enterprises collect huge amounts of data • Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … • User interaction data and history • Click and Transaction logs • Deep analysis critical for competitive edge • Understanding/Modeling data • Recommendations to users • Ad placement • Challenge: Enable Deep Analysis and Understanding over massive data volumes • Exploiting data to its full potential
Motivating Examples • Data Exploration/Model Evaluation/Outlier Detection • Personalized Recommendations • For each individual customer/product • Many applications to Netflix, Amazon, eBay, iTunes, … • Difficulty: Discern particular customer preferences • Sampling loses Competitive advantage • Application Scenario: Movie Recommendations, Netflix • Millions of Customers • Hundreds of thousands of Movies • Billions of Movie Ratings
Big Data and Deep Analytics – The Gap • R, SPSS, SAS – A Statistician’s toolbox • Rich statistical, modeling, visualization functionality • Operate on small data amounts entirely in memory • Extensions for data handling cumbersome • Hadoop – Scalable Data Management Systems • Scalable, Fault-Tolerant, Elastic, … • “Magnetic”: easy to store data • Limited deep analytics: mostly descriptive analytics
Filling the Gap: Existing Approaches • Reducing Data size by Sampling • Approximations might result in losing competitive advantage • Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009] • Scaling out R • Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] • Main memory based in most cases • Re-implementing DBMS and distributed processing functionality • Deep Analysis within a DBMS • Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] • Not Sustainable – missing out from R’s community development and rich libraries
Ricardo: Bridging the Gap • David Ricardo, famous economist from 19th century • “Comparative Advantage” • Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] • Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA • Recommender Systems/Latent Factorization [in the paper] • Large-part includes joins, group bys, distributive aggregations • Hadoop + Jaql: excellent scalability to large-scale data management • Small-part includes matrix/vector operations • R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc. • Ricardo: Establishes “trade” between R and Hadoop/Jaql
R in a Nutshell • R supports Rich statistical functionality
Jaql in a Nutshell • Scalable Descriptive Analysis using Hadoop • Jaql a representative declarative interface • JSON View of the data: • JaqlExample:
Ricardo: The Trading Architecture • Complexity of Trade between R and Hadoop • Simple Trading: Data Exploration • Complex Trading: Data Modeling
Simple Trading: Exploratory Analytics • Gain insights about data • Example - top-k outliers for a model • Identify data items on which the model performed most poorly • Helpful for improving accuracy of model • The trade: • Use complex statistical models using rich R functionality • Parallelize processing over entire data using Hadoop/Jaql
Complex Trading: Latent Factors • SVD-like matrix factorization • Minimize Square Error: Σi,j (piqj - rij)2 • The trade: • Use complex statistical models in R • Parallelize aggregate computations using Hadoop/Jaql q p
Latent Factor Models with Ricardo • Goal • Minimize Square Error: e = Σi,j (piqj - rij)2 • Numerical methods needed (large, sparse matrix) • Pseudocode • Start with initial guess of parameters piand qj • Compute error & gradient • e.g., de/dpi= Σj 2qj (piqj- rij) • (Data intensive, but parallelizable) • Update parameters • R implements many different optimization algorithms • Repeat steps 2 and 3 until convergence. • R code • optim( c(p,q), fe, fde, method="L-BFGS-B" )
Computing the Model e = Σi,j(piqj - rij)2 3 way join to matchrij, pi, and qj, then aggregate Movie Parameters Customer Parameters Similarly compute the gradients Movie Ratings
Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ] group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient ---- ---- -------- null null 325235 1 null 21 2 null 357 … null 1 9 null 2 64 … Result in R
Experimental Evaluation • 50 nodes at EC2 • Each node: 8 cores, 7GB Memory, 320GB Disk • Total: 400 cores, 320GB Memory, 70TB Disk Space
Result • Leveraging Hadoop’sScalability • Leveraging R’s Rich Functionality • optim( c(p,q), fe, fde, method=“CG" ) • optim( c(p,q), fe, fde, method="L-BFGS-B" )
Extending the Trade: R – Jaql – R • Invoking R through Jaql – distributed statistical computation • Example: Augment model with changing customer preferences with time • Time series model for each customer incorporated into global model
Conclusion • Scaled Latent Factor Models to Terabytes of data • Provided a bridge for other algorithms with Summation Form can be mapped and scaled • Many Algorithms have Summation Form • Decompose into “large part” and “small part” • [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM • Future & Current Work • Tighter language integration • More algorithms • Performance tuning
RHIPE - R and Hadoop Integrated Processing Environment SaptarshiGuha
RHIPE • R package • INSTALL • Set an environment variable $HADOOP that points to the Hadoopinstallation directory. • It is expected that $HADOOP\bin contains the Hadoop shell executable hadoop • This needs to be installed on all the computers: the one you run your R environment and all the task computers. • Use RHIPE is much easier if your filesystem layout (i.e location of R, Hadoop, libraries etc) is identical across all computers.
Tests • In R • should work successfully • should successfully write the list to the HDFS • should return a list of length 3 each element a list of 2 objects.
Tests (cont’d) • A quick run of this should also work
R and Hadoop Integrated Programming Environment • The R and Hadoop Integrated Programming Environment is R package • compute across massive data sets • create subsets • apply routines to subsets • produce displays on subsets across a cluster of computers • using the Hadoop DFS and HadoopMapReduce framework. • Use Hadoop Streaming • Users can write MapReduceprograms in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. • HadoopStreaming then transfers the input data from Hadoop to the user program and vice versa.
R and Hadoop Integrated Programming Environment • RHIPE is just that. • RHIPE consist of several functions to interact with the HDFS • e.g. save data sets, read data created by RHIPE MapReduce, delete files. • Commands in R • Compose and launch MapReduce jobs from R using the command rhmr and rhex. • Monitor the status using rhstatus which returns an R object. • Stop jobs using rhkill • Compute side effect files. • The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. • These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system. • Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. • The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets. • Data sets created using RHIPE are key-value pairs. • A key is mapped to a value. A MapReduce computations iterates over the key, value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external-memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.
Example: Airline Dataset • Copying the Data to the HDFS
Example: Airline Dataset (cont’d) • rhstatus
Example: Airline Dataset (cont’d) • Demonstration of using Hadoop as a Queryable Database
Demonstration of using Hadoop as a Queryable Database • Top 20 cities by total volume of flights.
Example: Transforming Text Data • Text data • The carrier name is column 9. • Southwest carrier code is WN, Delta is DL. • Only those rows with column 9 equal to WN or DL will be saved.
Example: Transforming Text Data (cont’d) • The output • 1 • 2