Introduction to Spark

Introduction to Spark Tangzk 2014/4/8

Outline • BDAS(the Berkeley Data Analytics Stack) • Spark • Other Components based on Spark • Introduction to Scala • Spark Programming Practices • Debugging and Testing Spark Programs • Learning Spark • Why are Previous MapReduce-Based Systems Slow? • Conclusion

BDAS(the Berkeley Data Analytics Stack)

Spark • Master/Slave architecture • In-Memory Computing Platform • Resilient Distributed Datasets(RDDs) abstraction • DAG Engine Execution • Fault Recovery using Lineage • Supporting Interactive data mining • Survived in Hadoop Ecosystem

Spark – In-Memory Computing • Hadoop: two-stages MapReduce topology • Spark: DAG executing topology

Spark – RDD(Resilient Distributed Datasets) • Data Computing and Storage Abstraction • Records organized by Partition • Immutable(Read Only) • only be created through transformations • Move computing instead of data • Coarse-grained Programming Interfaces • Partition Reconstruction using lineage

Spark-RDD: Coarse-grained Programming Interfaces • Transformation: defining one or more RDDs. • Action: return a value. • Lazy computation

Spark-RDD: Lineage graph • Spark Example: • Return the time fields of web GET access by “66.220.128.123”(time field is number 3 in a tab-separated format) logs = spark.textFile("hdfs://...") accessesByIp = logs.filter(_.startWith("66.220.128.123")) accessesByIp.persist() accessesByIp.filter(_.contains("GET")) .map(_.split('\t')(3)) .collect()

Spark - installation • Running Spark • Local Mode • Standalone Mode • Cluster Mode: on YARN/Mesos • Support Programming Language • Scala • Java • Python • Spark Interactive Shell

Other Components(1) - Shark • SQL and Rich Analytics at Scale. • Partial DAG Execution optimize query planning at runtime. • run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop.

Other Components(1) – Spark Streaming • Scalable fault-tolerant streaming computing framework • Discretized streams abstraction: separate continuous dataflow into batches of input data

Other Components(2) - MLBase • Distributed Machine Learning Library

Other Components(4) – GraphX • Distributed Graph Computing framework • RDG(Resilient Distributed Graph) abstraction • Supporting Gather-Apply-Scatter model in GraphLab

GAS model - GraphLab Mirror Mirror Mirror Master Machine 1 Machine 2 Gather Infos From Nbs Y’ Y’ Y’ Y Y Y’ Y Σ Σ1 Σ2 Y + + + Apply Vertex update Machine 3 Machine 4 Σ3 Σ4 From Jiang Wenrui’s thesis defense

Introduction to Scala(1) • Runs on JVM(and .Net) • Full interoperability with Java • Statically typed • Object Oriented • Functional Programming

Introduction to Scala(2) • Declare a list of integers • valints = List(1,2,4,5,7,3) • Declare a function, cube, compute the cube of an Int • def cube(a: Int): Int = a * a * a • Apply cube function to list. • val cubes = ints.map(x => cube(x)) • Sum the cube of integers. • cubes.reduce((x1,x2) => x1+x2) • Define a factorial function that computn! • deffact(n: Int): Int = { if(n == 0) 1 else n * fact(n-1)}

Spark in Practices(1) • “Hello Word”(interactive shell) • valtextFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) • “Hello Word”(Standalone App)

Spark in Practices(2) • PageRank in Spark

Debugging and Testing Spark Programs(1) • Running in Local Mode • sc = new SparkContext("local", name) • Debug in IDE • Running in Standalone/Cluster Mode • Job Web GUI on 8080/4040 • Log4j • jstack/jmap • dstat/iostat/lsof -p • Unit test • Test in local mode

Debugging and Testing Spark Programs(2) RDDJoin2: (2,4) (1,2) RDDJoin3: (1,(1,3)) (1,(2,3))

Debugging and Testing Spark Programs(3) • Tunning Spark • Large object in lambda operator should be replaced by broadcast variables instead. • Coalescing partitions avoid large number of empty tasks after filtering operations. • Make good use of partition for data locality.(mapPartitions) • Good choice of partitioning key to balance data • Set spark.local.dir to set of disks • Take care of the number of reduce tasks • Don’t collect data but write to HDFS directly.

Learning Spark • Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html • Holden Karau, Fast Data Processing with Spark • Spark Docs, http://spark.apache.org/docs/latest/ • Spark Source code, https://github.com/apache/spark • Spark User Mailing list, http://spark.apache.org/mailing-lists.html

Why are Previous MapReduce-Based Systems Slow? • Conventional thoughts: • expensive data materialization for fault tolerance, • inferior data layout (e.g., lack of indices), • costlier execution strategies. • But Hive/Shark alleviate these by: • In-memory computing and storage • Partial DAG execution • Experiment results in Shark: • Intermediate Outputs • Data Format and Layout by co-partitioning • Execution Strategies optimizing using PDE • Task Scheduling Cost

Conclusion • Spark • In-Memory computing platform for iterative and interactive tasks • RDD abstraction • Lineage reconstruction for fault recovery • Large number of components based on • Spark Programming • Just think RDD like vector • Function programming • Scala IDE is not strong enough. • Lack of good tools to debug and test.

Introduction to Spark

Introduction to Spark

Presentation Transcript

Spark

An introduction to Apache Spark

Spark

Spark

Spark

Spark

Spark

Spark

Introduction to Spark Internals

Spark

Introduction to Apache Spark

Spark

Spark

Spark

The Hadoop Stack, Part 3 Introduction to Spark

Introduction to Scala and Spark

Introduction to Apache Spark

Apache Spark - Introduction

Introduction to Apache Spark

Spark