SIMR Spark In MapReduce

SIMRSpark In MapReduce Ali Ghodsi Ahir Reddy UC Berkeley Databricks

Background • Hard to try out Spark on MapReduce v1 clusters • Separate machines • Installing Scala, Spark • Admin rights • Generally difficult to try out on a cluster • Configure, compile, Standalone, Mesos, YARN

SIMR • MapReduce job with Spark inside it • Launches Spark, Scala, your job, Spark-shell bash> ./simr --shell bash> ./simrmy.jartestClass %spark_url%

How does it work?

Under the hood • Ship to all mappers • Scala & Spark fat jar • Your job jar

Setting up Spark • Mappers write their ID to HDFS • Lowest timestamped mapper becomes leader • Leader mapper executes • Spark driver • Other mappers execute • Executors

Connecting everyone • Connect executors with driver • Driver writes URL to HDFS, executors busy-read • Spark is ready!

Interacting with Spark • Relay screen input & keyboard output • Relay Server executed on leader mapper • Relay Client executed on client machine • Connecting the two • Relay server writes URL to HDFS • Relay client reads and connects to server • Relay all input/output between client/driver

Hadoop versions • Precompiled for Hadoop • 1.0.4 (HDP 1.0-1.2) • 1.2.x (HDP 1.3) • 0.20 (CDH3) • 2.0.0 (CDH4) • Instructions on how to compile your own • http://databricks.github.io/simr

DEMO TIME

SIMR Spark In MapReduce

SIMR Spark In MapReduce

Presentation Transcript

MapReduce

MapReduce

MapReduce

SPP/APR/SSIP/ SiMR

Joins in mapreduce

Big Data Processing with MapReduce and Spark

MapReduce

MapReduce

MapReduce

MapReduce in Action

Sort in MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

What Makes Spark a Considerable Choice for Hadoop MapReduce?

Hadoop MapReduce Vs Spark: Which big data framework to choose

Hadoop MapReduce vs Spark | Hadoop Tutorial For Beginners | Hadoop & Spark Tutorial | Edureka

MapReduce

MapReduce