440 likes | 755 Views
Introduction to . Matei Zaharia spark -project.org. What is Spark?. Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through:
E N D
Introduction to Matei Zaharia spark-project.org
What is Spark? • Fast and expressive cluster computing system interoperable with Apache Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Scala, Java, Python • Interactive shell Up to 100×faster (2-10× on disk) Often 5× less code
Project History • Started in 2009, open sourced 2010 • 25 companies now contributing code • Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, … • Entered Apache incubator in June
A General Stack • Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark(SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) … Spark More details: amplab.berkeley.edu
This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification
Why a New Programming Model? • MapReduce greatly simplified big data analysis • But as soon as it got popular, users wanted more: • More complex, multi-pass analytics (e.g. ML, graph) • More interactive ad-hoc queries • More real-time stream processing • All 3 need faster data sharing across parallel jobs
Data Sharing in MapReduce HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO
Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk
Spark Programming Model • Key idea: resilient distributed datasets (RDDs) • Distributed collections of objects that can be cached in memory across cluster • Manipulated through parallel operators • Automatically recomputed on failure • Programming interface • Functional APIs in Scala, Java, Python • Interactive use from Scala shell
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Worker results tasks Driver Block 1 Action messages.filter(lambda x: “foo” in x).count messages.filter(lambda x: “bar” in x).count Cache 2 Worker . . . Cache 3 Worker Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 2 Block 3
Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file
Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file
Example: Logistic Regression 110 s / iteration first iteration 80 s further iterations 1 s
Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();
Supported Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
Spark Community • 1200+ meetup members • 90+ devs contributing • 25 companies contributing
This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification
Software Components • Spark client is library in user program (1 instance per app) • Runs tasks locally or on cluster • Mesos, YARN, standalone mode • Accesses storage systems via Hadoop InputFormat API • Can use HBase, HDFS, S3, … Your application SparkContext Cluster manager Local threads Worker Worker Spark executor Spark executor HDFS or other storage
Task Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning awareto avoid shuffles B: A: F: Stage 1 groupBy E: C: D: join map filter Stage 2 Stage 3 = RDD = cached partition
Advanced Features • Controllable partitioning • Speed up joins against a dataset • Controllable storage formats • Keep data serialized for efficiency, replicate to multiple nodes, cache on disk • Shared variables: broadcasts, accumulators • See online docs for details!
This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification
Shark: Hive on Spark • Columnar SQL analytics engine for Spark • Support both SQL and complex analytics • Up to 100X faster than Apache Hive • Compatible with Apache Hive • HiveQL, UDF/UDAF, SerDes, Scripts • Runs on existing Hive warehouses • In use at Yahoo! for fast in-memory OLAP
Hive Architecture Client Meta store CLI JDBC Driver Query Optimizer SQL Parser Physical Plan Execution MapReduce HDFS
Shark Architecture Client Meta store CLI JDBC Cache Mgr. Driver SQL Parser Physical Plan Query Optimizer Execution Spark HDFS [Engle et al, SIGMOD 2012]
Performance 1.7 TB Warehouse Data on 100 EC2 nodes
Spark Integration • Unified system for SQL, graph processing, machine learning • All share the same set of workers and caches
Other Stack Projects • Spark Streaming:stateful, fault-tolerant stream processing (out since Spark 0.7) • sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) • MLlib: Library of high-quality machine learning algorithms (out since 0.8)
This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification
The Power of Unification • Spark’s speed and programmability is great, but what makes the project unique? • Unification:multiple programming models (SQL, stream, graph, …) on the same engine • This had powerful benefits: • For the engine • For users Shark Streaming GraphX MLbase … Spark
Code Size non-test, non-example source lines
Code Size Shark non-test, non-example source lines
Code Size Streaming Shark non-test, non-example source lines
Code Size GraphX Streaming Shark non-test, non-example source lines
Performance SQL Streaming Graph
Performance • One benefit of unification is that optimizations we make for one project affect all the others! • E.g. scheduler optimizations in Spark Streaming (0.6) resulted in 2x faster Shark queries
What it Means for Users • Separate frameworks: HDFS read HDFS read HDFS read query train ETL HDFS write HDFS write HDFS write … Spark: train query HDFS read ETL HDFS
Getting Started • Visit spark-project.org for videos & tutorials • Easy to run in local mode, private clusters, EC2 • Built-in ML library since Spark 0.8 • Conference December 2nd:www.spark-summit.org
Conclusion • Big data analytics is evolving to include: • More complex analytics (e.g. machine learning) • More interactive ad-hoc queries • More real-time stream processing • Spark is a fast platform that unifies these apps • More info: spark-project.org