Introduction to

Introduction to Matei Zaharia spark-project.org

What is Spark? • Fast and expressive cluster computing system interoperable with Apache Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Scala, Java, Python • Interactive shell Up to 100×faster (2-10× on disk) Often 5× less code

Project History • Started in 2009, open sourced 2010 • 25 companies now contributing code • Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, … • Entered Apache incubator in June

A General Stack • Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark(SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) … Spark More details: amplab.berkeley.edu

This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification

Why a New Programming Model? • MapReduce greatly simplified big data analysis • But as soon as it got popular, users wanted more: • More complex, multi-pass analytics (e.g. ML, graph) • More interactive ad-hoc queries • More real-time stream processing • All 3 need faster data sharing across parallel jobs

Data Sharing in MapReduce HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO

Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk

Spark Programming Model • Key idea: resilient distributed datasets (RDDs) • Distributed collections of objects that can be cached in memory across cluster • Manipulated through parallel operators • Automatically recomputed on failure • Programming interface • Functional APIs in Scala, Java, Python • Interactive use from Scala shell

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Worker results tasks Driver Block 1 Action messages.filter(lambda x: “foo” in x).count messages.filter(lambda x: “bar” in x).count Cache 2 Worker . . . Cache 3 Worker Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 2 Block 3

Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

Demo

Example: Logistic Regression 110 s / iteration first iteration 80 s further iterations 1 s

Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();

Supported Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...

Spark Community • 1200+ meetup members • 90+ devs contributing • 25 companies contributing

Software Components • Spark client is library in user program (1 instance per app) • Runs tasks locally or on cluster • Mesos, YARN, standalone mode • Accesses storage systems via Hadoop InputFormat API • Can use HBase, HDFS, S3, … Your application SparkContext Cluster manager Local threads Worker Worker Spark executor Spark executor HDFS or other storage

Task Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning awareto avoid shuffles B: A: F: Stage 1 groupBy E: C: D: join map filter Stage 2 Stage 3 = RDD = cached partition

Advanced Features • Controllable partitioning • Speed up joins against a dataset • Controllable storage formats • Keep data serialized for efficiency, replicate to multiple nodes, cache on disk • Shared variables: broadcasts, accumulators • See online docs for details!

Shark: Hive on Spark • Columnar SQL analytics engine for Spark • Support both SQL and complex analytics • Up to 100X faster than Apache Hive • Compatible with Apache Hive • HiveQL, UDF/UDAF, SerDes, Scripts • Runs on existing Hive warehouses • In use at Yahoo! for fast in-memory OLAP

Hive Architecture Client Meta store CLI JDBC Driver Query Optimizer SQL Parser Physical Plan Execution MapReduce HDFS

Shark Architecture Client Meta store CLI JDBC Cache Mgr. Driver SQL Parser Physical Plan Query Optimizer Execution Spark HDFS [Engle et al, SIGMOD 2012]

Performance 1.7 TB Warehouse Data on 100 EC2 nodes

Spark Integration • Unified system for SQL, graph processing, machine learning • All share the same set of workers and caches

Other Stack Projects • Spark Streaming:stateful, fault-tolerant stream processing (out since Spark 0.7) • sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) • MLlib: Library of high-quality machine learning algorithms (out since 0.8)

The Power of Unification • Spark’s speed and programmability is great, but what makes the project unique? • Unification:multiple programming models (SQL, stream, graph, …) on the same engine • This had powerful benefits: • For the engine • For users Shark Streaming GraphX MLbase … Spark

Code Size non-test, non-example source lines

Code Size Shark non-test, non-example source lines

Code Size Streaming Shark non-test, non-example source lines

Code Size GraphX Streaming Shark non-test, non-example source lines

Performance SQL Streaming Graph

Performance • One benefit of unification is that optimizations we make for one project affect all the others! • E.g. scheduler optimizations in Spark Streaming (0.6) resulted in 2x faster Shark queries

What it Means for Users • Separate frameworks: HDFS read HDFS read HDFS read query train ETL HDFS write HDFS write HDFS write … Spark: train query HDFS read ETL HDFS

Getting Started • Visit spark-project.org for videos & tutorials • Easy to run in local mode, private clusters, EC2 • Built-in ML library since Spark 0.8 • Conference December 2nd:www.spark-summit.org

Conclusion • Big data analytics is evolving to include: • More complex analytics (e.g. machine learning) • More interactive ad-hoc queries • More real-time stream processing • Spark is a fast platform that unifies these apps • More info: spark-project.org

Introduction to

Introduction to

Presentation Transcript

INTRODUCTION TO…

Introduction to

Introduction to

Introduction to

Introduction to introduction to introduction to … Optimization

INTRODUCTION TO

Introduction to

Introduction to Bioinformatics Introduction to Databases

Introduction to Engineering Introduction to CAD

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

INTRODUCTION TO

INTRODUCTION to

Introduction to

Introduction to Concurrency: Introduction to Concurrency

Introduction to

Introduction to

Introduction to

Introduction to

Introduction to Psychophysiology Lecture 1- introduction to introduction

Introduction to Introduction to Artificial Intelligence