Ricardo: Integrating R and Hadoop

Ricardo: Integrating R and Hadoop Angel Trifonov Yun Lu Ying Wang

Contents • Introduction • Motivating Examples • Preliminaries • Ricardo Design • Experimental Study • Conclusion

Introduction

Data collection • Enterprise datasets • Why are these datasets important? • Statistical analysis on datasets • Data analyst workflow • Explore/summarize data • Built a model • Used to improve business practices • Need a statistical package

R and dms • R design • Single server • Main memory • Large data  FAIL! • Problem for analysts – they work with large datasets • Vertical scalability • Subsets • Neither is ideal! • Large-scale data management systems (DMS) • Example: Hadoop • Aggregation processing

ricardo • Overview • Scalable platform for deep analytics • Part of eXtreme Analytics Platform (XAP) project • Named after economist David Ricardo • Facilitates trading between R and Hadoop • Previous work on Map-Reduce • Small data – combined approach success • Several advantages

Ricardo advantages • Familiar working environment – work within a statistical environment • Data attraction – Hadoop’sflexible data store together with the Jaql query language • Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it • Reliability and community support – built from open-source projects • Improved user code – facilitates better code • Deep analytics –can handle many kinds of advanced statistical analyses • No re-inventing of wheels – combine existing statistical and DMS technology

Motivating examples

example 1: Simple trading • Analyst workflow: exploration • Graph shows movie perception over time • How does an analyst get this data visualization? • R is good for the job, BUT… • Ricardo can help!

example 2: Simple trading • Analyst workflow: evaluation – already have a model • Analysis must be on all the data • Ricardo can help once again • What did we see? • Simple trading • First case  pass to R • Second case  pass to Hadoop • More complicated analyses? No problem!

example 3: complex trading • Analyst workflow: modeling • How? • Simple-trading scheme  no good • Losing information • Ricardo permits complex trading • Data needs decomposition • Small parts  handled by R • Large parts  handled by Hadoop • Consider an example • Latent-factor model • Each piece of data must be taken into account • Simple-trading won’t work

Latent-factor model

preliminaries

The R project • Developed at the University of Auckland, New Zealand • Open-source language and statistical environment • Small maintenance team, but big popularity • Example of functionality: fit <- lm(df$mean ~ df$year) plot(df$year, df$mean) abline(fit) • Data frame equivalent

Large-scale dms • Enterprise data warehouses – dominant type of DMS • Designed for clean/structured data – not good • Analysts want their data dirty • What to do? Use Hadoop! • Hadoop method • Hadoop Distributed File System • Operates on raw data files • Process according to MapReduce • Map phase results fed to reducer • Used successfully on large-scale datasets • Appealing alternative

Jaql: A JSON Query Language • Hadoop drawback – programming interface • Attempts to help this • Ricardo uses Jaql • Open-source dataflow language • Jaql scripts automatically compiled • Operates directly on data files • JSON view: [{ customer: "Michael", movie: { name: "About Schmidt", year: 2002}, rating: 5}, ...], • Jaql query: read("ratings") -> group by year = $.movie.year into { year, mean: avg($[*].rating) } -> sort by [ $.year ].

Ricardo design

Problem Statement How to bridge between them? Advantage: -Large scale processing Disadvantage: Insufficient analytical functionality • Advantages: • -Statistical • software • -Data analysis • Disadvantages: • Operate in main memory • Limited data

Ricardo Design

Ricardo Design • R driver: • Not memory-resident • Does R need memory to store some data? • Hadoop : • Performance operations • Store data in HDFS • R-Jaql Bridge: • Connect between R driver and Hadoop cluster • Execute query (what kind of query?) • Send the result back to R as data frames • Allow Jaql queries to spawn R processes on Hadoop worker nodes.

R-JaqlBridge • Components: • R package(Jaql R and a Jaql module: R Jaql)  R  Hadoop  Hadoop  R  Hadoop  R  R  Hadoop

Ricardo Workflow • Analyst’s typical workflow • Data exploration • Preliminary observation • Simple trading • Model building • Depth Analytics • Complex trading • Model evaluation • Quality of models • Simple trading Why model building is complex trading?

Review Example • Movies recommendation Data Exploration Model Building Complex Trading: Latent-Factor Model Simple Trading: Linear Regression

Simple Trading – Linear Regression Get data from Hadoop Fit data

Simple Trading – Evaluate Model Fit data Select top 10 outliers

Complex Trading Model Building Objectives

Model Building Random pick up p and q Set up optimization method • Compute • Squared error (e) • The derivative of e with respect to p • The derivative of e with respect to q Update p and q Repeat it until convergence

Model Building • Table r: stores ratings • Table p and q: stores latent factors Table q Table r Table p

Details Compute the sum of squared errors Compute the gradient

Other Models • Principal component analysis (PCA) • Compute eigenvectors and eigenvalues • Perpendicular among eigenvectors • GLM • Compute response variable • Expressed as a nonlinear function • ……

Implementations • Java Native Interface (JNI) as the bridge between C and Java • How to transfer the data between JNI? • Naïve way • Better solution • Japl wrapper handles data-representation incompatibilities • This is in the bridge • What’s the component right now in the R-Jaql bridge now?

Experimental study

Related work • Scaling Out R • Low level message passing type • Task- and data-parallel computing systems • Automatic parallelization of high-level • Deeping a DMS

conclusion • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Future work • Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach.

references • S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010.

Ricardo: Integrating R and Hadoop

Ricardo: Integrating R and Hadoop

Presentation Transcript

Malthus and Ricardo

Ricardo: Integrating R and Hadoop

Integrating Hadoop and Parallel DBMS

MySQL and Hadoop

Hadoop , Hadoop , Hadoop !!!

Integrated r eporting and integrating sustainability

Hadoop and Friends

R + Hadoop = big data analytics

Big data Analysis in R using Hadoop

MapReduce and Hadoop

Mapreduce and Hadoop

Integrating Hadoop and Parallel DBMS

MALTHUS AND RICARDO

Ricardo

Ricardo

Hadoop

Hadoop

Introduction to Hadoop and Hadoop component

Bigdata and Hadoop

Ricardo and Malthus

Malthus and Ricardo