570 likes | 728 Views
Spark and Shark. High-Speed In-Memory Analytics over Hadoop and Hive Data. Reynold Xin 辛湜 (shi2) UC Berkeley. UC BERKELEY. Prof. Michael Franklin. My Background. PhD student in the AMP Lab at UC Berkeley 50-person lab focusing on big data
E N D
Spark and Shark High-Speed In-Memory Analyticsover Hadoop and Hive Data Reynold Xin 辛湜 (shi2) UC Berkeley UC BERKELEY
My Background • PhD student in the AMP Lab at UC Berkeley • 50-person lab focusing on big data • Works on the Berkeley Data Analytics System (BDAS) • Work/intern experience at Google Research, IBM DB2, Altera
AMPLab @ Berkeley • NSF & DARPA funded • Sponsors: Amazon Web Services / Google / SAP • Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies) • Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).
What is Spark? • Not a modified version of Hadoop • Separate, fast, MapReduce-like engine • In-memory data storage for very fast iterative queries • General execution graphs and powerful optimizations • Up to 100x faster than HadoopMapReduce • Compatible with Hadoop’s storage APIs • Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc
What is Shark? • A SQL analytics engine built on top of Spark • Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc) • Similar speedups of up to 100x
Project History • Spark project started in 2009, open sourced 2010 • Shark started summer 2011, open sourced 2012 • Spark 0.7 released Feb 2013 • Streaming alpha in Spark 0.7 • GraphLab support soon
Adoption • In use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease) • 600+ member meetup, 800+ watchers on GitHub • 30+ contributors (including Intel Shanghai) • AMPCamp 150 attendees, 5000 online attendees • AMPCamp on Mar 15, ECNU, Shanghai
This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark
Why go Beyond MapReduce? • MapReduce simplified big data analysis by giving a reliable programming model for large clusters • But as soon as it got popular, users wanted more: • More complex, multi-stage applications • More interactive ad-hoc queries
Why go Beyond MapReduce? • Complex jobs and interactive queries both need one thing that MapReduce lacks: • Efficient primitives for data sharing Query 1 Stage 3 Stage 1 Stage 2 Query 2 Query 3 Iterative algorithm Interactive data mining
Why go Beyond MapReduce? • Complex jobs and interactive queries both need one thing that MapReduce lacks: • Efficient primitives for data sharing Query 1 Stage 3 Stage 1 Stage 2 In MapReduce, the only way to share data across jobs is stable storage (e.g. HDFS) -> slow! Query 2 Query 3 Iterative algorithm Interactive data mining
Examples HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . I/O and serialization can take 90% of the time
Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk
Solution: ResilientDistributed Datasets (RDDs) • Distributed collections of objects that can be stored in memory for fast reuse • Automatically recover lost data on failure • Support a wide range of applications
Programming Model Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects • Can be cached in memory for efficient reuse • Transformations (e.g. map, filter, groupBy, join) • Build RDDs from other RDDs Actions (e.g. count, collect, save) • Return a result or write it to storage
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker results tasks Block 1 Driver Action cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 3
public static class WordCountMapClassextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduceextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Word Count • myrdd.flatMap{ doc => doc.split(“\s”)} .map{ word => (word, 1)} .reduceByKey { case(v1, v2) => v1 + v2 } • myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)
Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)
Fault Recovery Results Failure happens
Tradeoff Space Network bandwidth Memory bandwidth Fine Transactional workloads K-V stores, databases, RAMCloud Granularity of Updates Batch analytics HDFS RDDs Coarse Low High Write Throughput
Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() varw = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Finalw: " + w) Load data in memory once Initial parameter vector Repeated MapReduce stepsto do gradient descent
Logistic Regression Performance 110 s / iteration first iteration 80 s further iterations 1 s
Supported Operators • map • filter • groupBy • sort • join • leftOuterJoin • rightOuterJoin • reduce • count • reduceByKey • groupByKey • first • union • cross sample cogroup take partitionBy pipe save ...
Scheduler Dryad-like task DAG Pipelines functionswithin a stage Cache-aware datalocality & reuse Partitioning-awareto avoid shuffles B: A: G: Stage 1 groupBy F: D: C: map E: join Stage 2 union Stage 3 = previously computed partition
Implementation Use Mesos / YARN to share resources with Hadoop & other frameworks Can access any Hadoop input source (HDFS, S3, …) • Core engine is only 20k lines of code Spark Hadoop MPI … YARN Mesos Node Node Node Node
User Applications • In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) • Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch) . . .
ConvivaGeoReport • Group aggregations on many keys w/ same filter • 40× gain over Hive from avoiding repeated reading, deserialization and filtering Time (hours)
Mobile Millennium Project • Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark
Challenges • Data volumes expanding. • Faults and stragglers complicate parallel database design. • Complexity of analysis: machine learning, graph algorithms, etc. • Low-latency, interactivity.
MPP Databases • Vertica, SAP HANA, Teradata, Google Dremel... • Fast! • Generally not fault-tolerant; challenging for long running queries as clusters scale up. • Lack rich analytics such as machine learning and graph algorithms.
MapReduce • Apache Hive, Google Tenzing, Turn Cheetah … • Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resource sharing. • Expressive Machine Learning algorithms. • High-latency, dismissed for interactive workloads.
Shark • A data warehouse that • builds on Spark, • scales out and is fault-tolerant, • supports low-latency, interactive queries through in-memory computation, • supports both SQL and complex analytics, • is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).
Hive Architecture Client Meta store CLI JDBC Driver Query Optimizer SQL Parser Physical Plan Execution MapReduce HDFS
Shark Architecture Client Meta store CLI JDBC Cache Mgr. Driver SQL Parser Physical Plan Query Optimizer Execution Spark HDFS
Engine Features • Columnar Memory Store • Machine Learning Integration • Partial DAG Execution • Hash-based Shuffle vs Sort-based Shuffle • Data Co-partitioning • Partition Pruning based on Range Statistics • ...
Efficient In-Memory Storage • Simply caching Hive records as Java objects is inefficient due to high per-object overhead • Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
Efficient In-Memory Storage • Simply caching Hive records as Java objects is inefficient due to high per-object overhead • Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage Benefit: similarly compact size to serialized data,but >5x faster to access 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
Machine Learning Integration • Unified system for query processing and machine learning • Query processing and ML share the same set of workers and caches
Performance 1.7 TB Real Warehouse Data on 100 EC2 nodes
This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark
Why are previous MR-based systems slow? • Disk-based intermediate outputs. • Inferior data format and layout (no control of data co-partitioning). • Execution strategies (lack of optimization based on data statistics). • Task scheduling and launch overhead!
Task Launch Overhead • Hadoopuses heartbeat to communicate scheduling decisions. • Task launch delay 5 - 10 seconds. • Spark uses an event-driven architecture and can launch tasks in 5ms • better parallelism • easier straggler mitigation • elasticity • multi-tenancy resource sharing