460 likes | 755 Views
Ricardo: Integrating R and Hadoop. Angel Trifonov Yun Lu Ying Wang. Contents. Introduction Motivating Examples Preliminaries Ricardo Design Experimental Study Conclusion. Introduction. Data collection. Enterprise datasets Why are these datasets important?
E N D
Ricardo: Integrating R and Hadoop Angel Trifonov Yun Lu Ying Wang
Contents • Introduction • Motivating Examples • Preliminaries • Ricardo Design • Experimental Study • Conclusion
Data collection • Enterprise datasets • Why are these datasets important? • Statistical analysis on datasets • Data analyst workflow • Explore/summarize data • Built a model • Used to improve business practices • Need a statistical package
R and dms • R design • Single server • Main memory • Large data FAIL! • Problem for analysts – they work with large datasets • Vertical scalability • Subsets • Neither is ideal! • Large-scale data management systems (DMS) • Example: Hadoop • Aggregation processing
ricardo • Overview • Scalable platform for deep analytics • Part of eXtreme Analytics Platform (XAP) project • Named after economist David Ricardo • Facilitates trading between R and Hadoop • Previous work on Map-Reduce • Small data – combined approach success • Several advantages
Ricardo advantages • Familiar working environment – work within a statistical environment • Data attraction – Hadoop’sflexible data store together with the Jaql query language • Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it • Reliability and community support – built from open-source projects • Improved user code – facilitates better code • Deep analytics –can handle many kinds of advanced statistical analyses • No re-inventing of wheels – combine existing statistical and DMS technology
example 1: Simple trading • Analyst workflow: exploration • Graph shows movie perception over time • How does an analyst get this data visualization? • R is good for the job, BUT… • Ricardo can help!
example 2: Simple trading • Analyst workflow: evaluation – already have a model • Analysis must be on all the data • Ricardo can help once again • What did we see? • Simple trading • First case pass to R • Second case pass to Hadoop • More complicated analyses? No problem!
example 3: complex trading • Analyst workflow: modeling • How? • Simple-trading scheme no good • Losing information • Ricardo permits complex trading • Data needs decomposition • Small parts handled by R • Large parts handled by Hadoop • Consider an example • Latent-factor model • Each piece of data must be taken into account • Simple-trading won’t work
The R project • Developed at the University of Auckland, New Zealand • Open-source language and statistical environment • Small maintenance team, but big popularity • Example of functionality: fit <- lm(df$mean ~ df$year) plot(df$year, df$mean) abline(fit) • Data frame equivalent
Large-scale dms • Enterprise data warehouses – dominant type of DMS • Designed for clean/structured data – not good • Analysts want their data dirty • What to do? Use Hadoop! • Hadoop method • Hadoop Distributed File System • Operates on raw data files • Process according to MapReduce • Map phase results fed to reducer • Used successfully on large-scale datasets • Appealing alternative
Jaql: A JSON Query Language • Hadoop drawback – programming interface • Attempts to help this • Ricardo uses Jaql • Open-source dataflow language • Jaql scripts automatically compiled • Operates directly on data files • JSON view: [{ customer: "Michael", movie: { name: "About Schmidt", year: 2002}, rating: 5}, ...], • Jaql query: read("ratings") -> group by year = $.movie.year into { year, mean: avg($[*].rating) } -> sort by [ $.year ].
Problem Statement How to bridge between them? Advantage: -Large scale processing Disadvantage: Insufficient analytical functionality • Advantages: • -Statistical • software • -Data analysis • Disadvantages: • Operate in main memory • Limited data
Ricardo Design • R driver: • Not memory-resident • Does R need memory to store some data? • Hadoop : • Performance operations • Store data in HDFS • R-Jaql Bridge: • Connect between R driver and Hadoop cluster • Execute query (what kind of query?) • Send the result back to R as data frames • Allow Jaql queries to spawn R processes on Hadoop worker nodes.
R-JaqlBridge • Components: • R package(Jaql R and a Jaql module: R Jaql) R Hadoop Hadoop R Hadoop R R Hadoop
Ricardo Workflow • Analyst’s typical workflow • Data exploration • Preliminary observation • Simple trading • Model building • Depth Analytics • Complex trading • Model evaluation • Quality of models • Simple trading Why model building is complex trading?
Review Example • Movies recommendation Data Exploration Model Building Complex Trading: Latent-Factor Model Simple Trading: Linear Regression
Simple Trading – Linear Regression Get data from Hadoop Fit data
Simple Trading – Evaluate Model Fit data Select top 10 outliers
Complex Trading Model Building Objectives
Model Building Random pick up p and q Set up optimization method • Compute • Squared error (e) • The derivative of e with respect to p • The derivative of e with respect to q Update p and q Repeat it until convergence
Model Building • Table r: stores ratings • Table p and q: stores latent factors Table q Table r Table p
Details Compute the sum of squared errors Compute the gradient
Other Models • Principal component analysis (PCA) • Compute eigenvectors and eigenvalues • Perpendicular among eigenvectors • GLM • Compute response variable • Expressed as a nonlinear function • ……
Implementations • Java Native Interface (JNI) as the bridge between C and Java • How to transfer the data between JNI? • Naïve way • Better solution • Japl wrapper handles data-representation incompatibilities • This is in the bridge • What’s the component right now in the R-Jaql bridge now?
Related work • Scaling Out R • Low level message passing type • Task- and data-parallel computing systems • Automatic parallelization of high-level • Deeping a DMS
conclusion • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Future work • Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach.
references • S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010.