380 likes | 569 Views
Introduction to . Matei Zaharia, Pat McDonough spark.apache.org. What is Apache Spark?. Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through:
E N D
Introduction to Matei Zaharia, Pat McDonough spark.apache.org
What is Apache Spark? • Fast and general cluster computing system interoperable with Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Scala, Java, Python • Interactive shell Up to 100×faster (2-10× on disk) 2-5× less code
Project History • Started in 2009, open sourced 2010 • 30+ companies now contributing code • Databricks, Yahoo!, Intel, Adobe, Cloudera, Bizo, … • One of the largest communities in big data
A General Stack Shark SQL Spark Streamingreal-time GraphX graph MLlib machine learning … Spark
This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo
Why a New Programming Model? • MapReduce greatly simplified big data analysis • But once started, users wanted more: • More complex, multi-pass analytics (e.g. ML, graph) • More interactive ad-hoc queries • More real-time stream processing • All 3 need faster data sharing in parallel apps
Data Sharing in MapReduce HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO
What We’d Like iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk
Spark Model • Write programs in terms of transformations on distributed datasets • Resilient Distributed Datasets (RDDs) • Collections of objects that can be stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Worker results tasks Driver Block 1 Action messages.filter(lambda x: “foo” in x).count() messages.filter(lambda x: “bar” in x).count() Cache 2 Worker . . . Cache 3 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 2 Block 3
Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file
Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file
Example: Logistic Regression 110 s / iteration first iteration 80 s further iterations 1 s
Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();
Supported Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
Spark Community • One of the largest opensource projects in big data • 150+ developers contributing • 30+ companies contributing Contributors in past year
Community Growth Spark 0.9: 83 contributors Spark 0.8: 67 contributors Spark 0.7:31 contributors Spark 0.6: 17 contributors Oct ‘12 Feb ‘13 Sept ‘13 Feb ‘14
This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo
Shark: Hive on Spark • Columnar SQL analytics engine • Both SQL and complex analytics • Up to 100x faster than Hive • Compatible with Apache Hive • HiveQL, UDFs, SerDes, scripts • Existing Hive warehouses • In use at Yahoo! for BI
Spark Integration • Unified system for SQL, graphs, machine learning • All share the same set of workers and caches
Spark Streaming • Stateful, fault-tolerant stream processing with the same API as batch jobs • sc.twitterStream(...) .flatMap(tweet => tweet.text.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)
MLlib • Built-in library of machine learning algorithms • K-means clustering • Alternating least squares • Linear regression (with L1 / L2 reg.) • Logistic regression (with L1 / L2 reg.) • Naïve Bayes val points = sc.textFile(...).map(parsePoint)val model = KMeans.train(points, 10)
Others • GraphX:Pregel-like graph processing and algorithm library, integrated directly in Spark • BlinkDB: approximate queries for Shark • SparkR: R API and library
This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo
Big Data Systems Today Giraph Pregel Drill Dremel Tez MapReduce Impala GraphLab Storm … S4 General batchprocessing Specialized systems (iterative, interactive andstreaming apps)
Spark’s Approach • Instead of specializing, generalize MapReduceto support new apps in same engine • Two changes (general task DAG & data sharing) are enough to express previous models! • Unification has big benefits • For the engine • For users Shark Streaming GraphX MLbase … Spark
Code Size non-test, non-example source lines
Code Size Streaming non-test, non-example source lines
Code Size Shark* Streaming non-test, non-example source lines * also calls into Hive
Code Size GraphX Shark* Streaming non-test, non-example source lines * also calls into Hive
Performance Streaming SQL Graph
What it Means for Users • Separate frameworks: HDFS read HDFS read HDFS read ETL train query HDFS write HDFS write HDFS write … Spark: Interactiveanalysis train query HDFS read ETL HDFS
Combining Processing Types • val points = sc.runSql[Double, Double](“select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
Get Started • Visit spark.apache.org for videos & tutorials • Download Spark bundle for CDH • Easy to run on just your laptop • Free training talks and hands-onexercises:spark-summit.org
Conclusion • Big data analytics is evolving to include: • More complex analytics (e.g. machine learning) • More interactive ad-hoc queries • More real-time stream processing • Spark is a fast platform that unifies these apps • Join us at Spark Summit 2014!June 30-July 2, San Francisco