100 likes | 320 Views
SIMR Spark In MapReduce. Ali Ghodsi Ahir Reddy UC Berkeley Databricks. Background. Hard to try out Spark on MapReduce v1 clusters Separate machines Installing Scala , Spark Admin rights Generally difficult to try out on a cluster
E N D
SIMRSpark In MapReduce Ali Ghodsi Ahir Reddy UC Berkeley Databricks
Background • Hard to try out Spark on MapReduce v1 clusters • Separate machines • Installing Scala, Spark • Admin rights • Generally difficult to try out on a cluster • Configure, compile, Standalone, Mesos, YARN
SIMR • MapReduce job with Spark inside it • Launches Spark, Scala, your job, Spark-shell bash> ./simr --shell bash> ./simrmy.jartestClass %spark_url%
Under the hood • Ship to all mappers • Scala & Spark fat jar • Your job jar
Setting up Spark • Mappers write their ID to HDFS • Lowest timestamped mapper becomes leader • Leader mapper executes • Spark driver • Other mappers execute • Executors
Connecting everyone • Connect executors with driver • Driver writes URL to HDFS, executors busy-read • Spark is ready!
Interacting with Spark • Relay screen input & keyboard output • Relay Server executed on leader mapper • Relay Client executed on client machine • Connecting the two • Relay server writes URL to HDFS • Relay client reads and connects to server • Relay all input/output between client/driver
Hadoop versions • Precompiled for Hadoop • 1.0.4 (HDP 1.0-1.2) • 1.2.x (HDP 1.3) • 0.20 (CDH3) • 2.0.0 (CDH4) • Instructions on how to compile your own • http://databricks.github.io/simr