240 likes | 544 Views
Introduction to Spark. Tangzk 2014/4/8. Outline. BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce -Based Systems Slow ?
E N D
Introduction to Spark Tangzk 2014/4/8
Outline • BDAS(the Berkeley Data Analytics Stack) • Spark • Other Components based on Spark • Introduction to Scala • Spark Programming Practices • Debugging and Testing Spark Programs • Learning Spark • Why are Previous MapReduce-Based Systems Slow? • Conclusion
Spark • Master/Slave architecture • In-Memory Computing Platform • Resilient Distributed Datasets(RDDs) abstraction • DAG Engine Execution • Fault Recovery using Lineage • Supporting Interactive data mining • Survived in Hadoop Ecosystem
Spark – In-Memory Computing • Hadoop: two-stages MapReduce topology • Spark: DAG executing topology
Spark – RDD(Resilient Distributed Datasets) • Data Computing and Storage Abstraction • Records organized by Partition • Immutable(Read Only) • only be created through transformations • Move computing instead of data • Coarse-grained Programming Interfaces • Partition Reconstruction using lineage
Spark-RDD: Coarse-grained Programming Interfaces • Transformation: defining one or more RDDs. • Action: return a value. • Lazy computation
Spark-RDD: Lineage graph • Spark Example: • Return the time fields of web GET access by “66.220.128.123”(time field is number 3 in a tab-separated format) logs = spark.textFile("hdfs://...") accessesByIp = logs.filter(_.startWith("66.220.128.123")) accessesByIp.persist() accessesByIp.filter(_.contains("GET")) .map(_.split('\t')(3)) .collect()
Spark - installation • Running Spark • Local Mode • Standalone Mode • Cluster Mode: on YARN/Mesos • Support Programming Language • Scala • Java • Python • Spark Interactive Shell
Other Components(1) - Shark • SQL and Rich Analytics at Scale. • Partial DAG Execution optimize query planning at runtime. • run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop.
Other Components(1) – Spark Streaming • Scalable fault-tolerant streaming computing framework • Discretized streams abstraction: separate continuous dataflow into batches of input data
Other Components(2) - MLBase • Distributed Machine Learning Library
Other Components(4) – GraphX • Distributed Graph Computing framework • RDG(Resilient Distributed Graph) abstraction • Supporting Gather-Apply-Scatter model in GraphLab
GAS model - GraphLab Mirror Mirror Mirror Master Machine 1 Machine 2 Gather Infos From Nbs Y’ Y’ Y’ Y Y Y’ Y Σ Σ1 Σ2 Y + + + Apply Vertex update Machine 3 Machine 4 Σ3 Σ4 From Jiang Wenrui’s thesis defense
Introduction to Scala(1) • Runs on JVM(and .Net) • Full interoperability with Java • Statically typed • Object Oriented • Functional Programming
Introduction to Scala(2) • Declare a list of integers • valints = List(1,2,4,5,7,3) • Declare a function, cube, compute the cube of an Int • def cube(a: Int): Int = a * a * a • Apply cube function to list. • val cubes = ints.map(x => cube(x)) • Sum the cube of integers. • cubes.reduce((x1,x2) => x1+x2) • Define a factorial function that computn! • deffact(n: Int): Int = { if(n == 0) 1 else n * fact(n-1)}
Spark in Practices(1) • “Hello Word”(interactive shell) • valtextFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) • “Hello Word”(Standalone App)
Spark in Practices(2) • PageRank in Spark
Debugging and Testing Spark Programs(1) • Running in Local Mode • sc = new SparkContext("local", name) • Debug in IDE • Running in Standalone/Cluster Mode • Job Web GUI on 8080/4040 • Log4j • jstack/jmap • dstat/iostat/lsof -p • Unit test • Test in local mode
Debugging and Testing Spark Programs(2) RDDJoin2: (2,4) (1,2) RDDJoin3: (1,(1,3)) (1,(2,3))
Debugging and Testing Spark Programs(3) • Tunning Spark • Large object in lambda operator should be replaced by broadcast variables instead. • Coalescing partitions avoid large number of empty tasks after filtering operations. • Make good use of partition for data locality.(mapPartitions) • Good choice of partitioning key to balance data • Set spark.local.dir to set of disks • Take care of the number of reduce tasks • Don’t collect data but write to HDFS directly.
Learning Spark • Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html • Holden Karau, Fast Data Processing with Spark • Spark Docs, http://spark.apache.org/docs/latest/ • Spark Source code, https://github.com/apache/spark • Spark User Mailing list, http://spark.apache.org/mailing-lists.html
Why are Previous MapReduce-Based Systems Slow? • Conventional thoughts: • expensive data materialization for fault tolerance, • inferior data layout (e.g., lack of indices), • costlier execution strategies. • But Hive/Shark alleviate these by: • In-memory computing and storage • Partial DAG execution • Experiment results in Shark: • Intermediate Outputs • Data Format and Layout by co-partitioning • Execution Strategies optimizing using PDE • Task Scheduling Cost
Conclusion • Spark • In-Memory computing platform for iterative and interactive tasks • RDD abstraction • Lineage reconstruction for fault recovery • Large number of components based on • Spark Programming • Just think RDD like vector • Function programming • Scala IDE is not strong enough. • Lack of good tools to debug and test.