Introduction to

Introduction to Matei Zaharia, Pat McDonough spark.apache.org

What is Apache Spark? • Fast and general cluster computing system interoperable with Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Scala, Java, Python • Interactive shell Up to 100×faster (2-10× on disk) 2-5× less code

Project History • Started in 2009, open sourced 2010 • 30+ companies now contributing code • Databricks, Yahoo!, Intel, Adobe, Cloudera, Bizo, … • One of the largest communities in big data

A General Stack Shark SQL Spark Streamingreal-time GraphX graph MLlib machine learning … Spark

This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo

Why a New Programming Model? • MapReduce greatly simplified big data analysis • But once started, users wanted more: • More complex, multi-pass analytics (e.g. ML, graph) • More interactive ad-hoc queries • More real-time stream processing • All 3 need faster data sharing in parallel apps

Data Sharing in MapReduce HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO

What We’d Like iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk

Spark Model • Write programs in terms of transformations on distributed datasets • Resilient Distributed Datasets (RDDs) • Collections of objects that can be stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Worker results tasks Driver Block 1 Action messages.filter(lambda x: “foo” in x).count() messages.filter(lambda x: “bar” in x).count() Cache 2 Worker . . . Cache 3 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 2 Block 3

Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

Example: Logistic Regression 110 s / iteration first iteration 80 s further iterations 1 s

Behavior with Less RAM

Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();

Supported Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...

Spark Community • One of the largest opensource projects in big data • 150+ developers contributing • 30+ companies contributing Contributors in past year

Community Growth Spark 0.9: 83 contributors Spark 0.8: 67 contributors Spark 0.7:31 contributors Spark 0.6: 17 contributors Oct ‘12 Feb ‘13 Sept ‘13 Feb ‘14

Shark: Hive on Spark • Columnar SQL analytics engine • Both SQL and complex analytics • Up to 100x faster than Hive • Compatible with Apache Hive • HiveQL, UDFs, SerDes, scripts • Existing Hive warehouses • In use at Yahoo! for BI

Spark Integration • Unified system for SQL, graphs, machine learning • All share the same set of workers and caches

Spark Streaming • Stateful, fault-tolerant stream processing with the same API as batch jobs • sc.twitterStream(...) .flatMap(tweet => tweet.text.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

MLlib • Built-in library of machine learning algorithms • K-means clustering • Alternating least squares • Linear regression (with L1 / L2 reg.) • Logistic regression (with L1 / L2 reg.) • Naïve Bayes val points = sc.textFile(...).map(parsePoint)val model = KMeans.train(points, 10)

Others • GraphX:Pregel-like graph processing and algorithm library, integrated directly in Spark • BlinkDB: approximate queries for Shark • SparkR: R API and library

Big Data Systems Today Giraph Pregel Drill Dremel Tez MapReduce Impala GraphLab Storm … S4 General batchprocessing Specialized systems (iterative, interactive andstreaming apps)

Spark’s Approach • Instead of specializing, generalize MapReduceto support new apps in same engine • Two changes (general task DAG & data sharing) are enough to express previous models! • Unification has big benefits • For the engine • For users Shark Streaming GraphX MLbase … Spark

Code Size non-test, non-example source lines

Code Size Streaming non-test, non-example source lines

Code Size Shark* Streaming non-test, non-example source lines * also calls into Hive

Code Size GraphX Shark* Streaming non-test, non-example source lines * also calls into Hive

Performance Streaming SQL Graph

What it Means for Users • Separate frameworks: HDFS read HDFS read HDFS read ETL train query HDFS write HDFS write HDFS write … Spark: Interactiveanalysis train query HDFS read ETL HDFS

Combining Processing Types • val points = sc.runSql[Double, Double](“select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

Demo

Get Started • Visit spark.apache.org for videos & tutorials • Download Spark bundle for CDH • Easy to run on just your laptop • Free training talks and hands-onexercises:spark-summit.org

Conclusion • Big data analytics is evolving to include: • More complex analytics (e.g. machine learning) • More interactive ad-hoc queries • More real-time stream processing • Spark is a fast platform that unifies these apps • Join us at Spark Summit 2014!June 30-July 2, San Francisco

Introduction to

Introduction to

Presentation Transcript

INTRODUCTION TO…

Introduction to

Introduction to

Introduction to

Introduction to introduction to introduction to … Optimization

Introduction to

Introduction to Bioinformatics Introduction to Databases

Introduction to Engineering Introduction to CAD

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

INTRODUCTION TO

INTRODUCTION to

Introduction to

Introduction to Concurrency: Introduction to Concurrency

Introduction to

Introduction to

Introduction to

Introduction to

Introduction to Psychophysiology Lecture 1- introduction to introduction

Introduction to Introduction to Artificial Intelligence