Spark and Shark

Spark and Shark High-Speed In-Memory Analyticsover Hadoop and Hive Data Reynold Xin 辛湜 (shi2) UC Berkeley UC BERKELEY

Prof. Michael Franklin

My Background • PhD student in the AMP Lab at UC Berkeley • 50-person lab focusing on big data • Works on the Berkeley Data Analytics System (BDAS) • Work/intern experience at Google Research, IBM DB2, Altera

AMPLab @ Berkeley • NSF & DARPA funded • Sponsors: Amazon Web Services / Google / SAP • Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies) • Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).

What is Spark? • Not a modified version of Hadoop • Separate, fast, MapReduce-like engine • In-memory data storage for very fast iterative queries • General execution graphs and powerful optimizations • Up to 100x faster than HadoopMapReduce • Compatible with Hadoop’s storage APIs • Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

What is Shark? • A SQL analytics engine built on top of Spark • Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc) • Similar speedups of up to 100x

Project History • Spark project started in 2009, open sourced 2010 • Shark started summer 2011, open sourced 2012 • Spark 0.7 released Feb 2013 • Streaming alpha in Spark 0.7 • GraphLab support soon

Adoption • In use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease) • 600+ member meetup, 800+ watchers on GitHub • 30+ contributors (including Intel Shanghai) • AMPCamp 150 attendees, 5000 online attendees • AMPCamp on Mar 15, ECNU, Shanghai

This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark

Why go Beyond MapReduce? • MapReduce simplified big data analysis by giving a reliable programming model for large clusters • But as soon as it got popular, users wanted more: • More complex, multi-stage applications • More interactive ad-hoc queries

Why go Beyond MapReduce? • Complex jobs and interactive queries both need one thing that MapReduce lacks: • Efficient primitives for data sharing Query 1 Stage 3 Stage 1 Stage 2 Query 2 Query 3 Iterative algorithm Interactive data mining

Why go Beyond MapReduce? • Complex jobs and interactive queries both need one thing that MapReduce lacks: • Efficient primitives for data sharing Query 1 Stage 3 Stage 1 Stage 2 In MapReduce, the only way to share data across jobs is stable storage (e.g. HDFS) -> slow! Query 2 Query 3 Iterative algorithm Interactive data mining

Examples HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . I/O and serialization can take 90% of the time

Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk

Solution: ResilientDistributed Datasets (RDDs) • Distributed collections of objects that can be stored in memory for fast reuse • Automatically recover lost data on failure • Support a wide range of applications

Programming Model Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects • Can be cached in memory for efficient reuse • Transformations (e.g. map, filter, groupBy, join) • Build RDDs from other RDDs Actions (e.g. count, collect, save) • Return a result or write it to storage

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker results tasks Block 1 Driver Action cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 3

public static class WordCountMapClassextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduceextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Word Count • myrdd.flatMap{ doc => doc.split(“\s”)} .map{ word => (word, 1)} .reduceByKey { case(v1, v2) => v1 + v2 } • myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)

Fault Recovery Results Failure happens

Tradeoff Space Network bandwidth Memory bandwidth Fine Transactional workloads K-V stores, databases, RAMCloud Granularity of Updates Batch analytics HDFS RDDs Coarse Low High Write Throughput

Behavior with Not Enough RAM

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() varw = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Finalw: " + w) Load data in memory once Initial parameter vector Repeated MapReduce stepsto do gradient descent

Logistic Regression Performance 110 s / iteration first iteration 80 s further iterations 1 s

Supported Operators • map • filter • groupBy • sort • join • leftOuterJoin • rightOuterJoin • reduce • count • reduceByKey • groupByKey • first • union • cross sample cogroup take partitionBy pipe save ...

Scheduler Dryad-like task DAG Pipelines functionswithin a stage Cache-aware datalocality & reuse Partitioning-awareto avoid shuffles B: A: G: Stage 1 groupBy F: D: C: map E: join Stage 2 union Stage 3 = previously computed partition

Implementation Use Mesos / YARN to share resources with Hadoop & other frameworks Can access any Hadoop input source (HDFS, S3, …) • Core engine is only 20k lines of code Spark Hadoop MPI … YARN Mesos Node Node Node Node

User Applications • In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) • Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch) . . .

ConvivaGeoReport • Group aggregations on many keys w/ same filter • 40× gain over Hive from avoiding repeated reading, deserialization and filtering Time (hours)

Mobile Millennium Project • Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Challenges • Data volumes expanding. • Faults and stragglers complicate parallel database design. • Complexity of analysis: machine learning, graph algorithms, etc. • Low-latency, interactivity.

MPP Databases • Vertica, SAP HANA, Teradata, Google Dremel... • Fast! • Generally not fault-tolerant; challenging for long running queries as clusters scale up. • Lack rich analytics such as machine learning and graph algorithms.

MapReduce • Apache Hive, Google Tenzing, Turn Cheetah … • Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resource sharing. • Expressive Machine Learning algorithms. • High-latency, dismissed for interactive workloads.

Shark • A data warehouse that • builds on Spark, • scales out and is fault-tolerant, • supports low-latency, interactive queries through in-memory computation, • supports both SQL and complex analytics, • is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

Hive Architecture Client Meta store CLI JDBC Driver Query Optimizer SQL Parser Physical Plan Execution MapReduce HDFS

Shark Architecture Client Meta store CLI JDBC Cache Mgr. Driver SQL Parser Physical Plan Query Optimizer Execution Spark HDFS

Engine Features • Columnar Memory Store • Machine Learning Integration • Partial DAG Execution • Hash-based Shuffle vs Sort-based Shuffle • Data Co-partitioning • Partition Pruning based on Range Statistics • ...

Efficient In-Memory Storage • Simply caching Hive records as Java objects is inefficient due to high per-object overhead • Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

Efficient In-Memory Storage • Simply caching Hive records as Java objects is inefficient due to high per-object overhead • Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage Benefit: similarly compact size to serialized data,but >5x faster to access 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

Machine Learning Integration • Unified system for query processing and machine learning • Query processing and ML share the same set of workers and caches

Performance 1.7 TB Real Warehouse Data on 100 EC2 nodes

Why are previous MR-based systems slow? • Disk-based intermediate outputs. • Inferior data format and layout (no control of data co-partitioning). • Execution strategies (lack of optimization based on data statistics). • Task scheduling and launch overhead!

Task Launch Overhead • Hadoopuses heartbeat to communicate scheduling decisions. • Task launch delay 5 - 10 seconds. • Spark uses an event-driven architecture and can launch tasks in 5ms • better parallelism • easier straggler mitigation • elasticity • multi-tenancy resource sharing

Spark and Shark

Spark and Shark

Presentation Transcript

Spark

Spark

Spark

Analytics on Spark & Shark @ Yahoo

Shark: Hive (SQL) on Spark

Spark

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark and Shark

Using Spark and Shark for Fast Cycle Analysis on Diverse Data

Spark

Spark

Spark

Spark

Spark

What’s New in Spark 0.6 and Shark 0.2

Shark: Hive (SQL) on Spark

Spark

Spark

Spark

SPARK

Scala and Spark

Spark

Spark and Shark

Spark and Shark

Presentation Transcript

Spark

Spark

Spark

Analytics on Spark &amp; Shark @ Yahoo

Shark: Hive (SQL) on Spark

Spark

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark and Shark

Using Spark and Shark for Fast Cycle Analysis on Diverse Data

Spark

Spark

Spark

Spark

Spark

What’s New in Spark 0.6 and Shark 0.2

Shark: Hive (SQL) on Spark

Spark

Spark

Spark

SPARK

Scala and Spark

Spark

Analytics on Spark & Shark @ Yahoo