1 / 24

Introduction to Spark

Introduction to Spark. Tangzk 2014/4/8. Outline. BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce -Based Systems Slow ?

cooper-kirk
Download Presentation

Introduction to Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Spark Tangzk 2014/4/8

  2. Outline • BDAS(the Berkeley Data Analytics Stack) • Spark • Other Components based on Spark • Introduction to Scala • Spark Programming Practices • Debugging and Testing Spark Programs • Learning Spark • Why are Previous MapReduce-Based Systems Slow? • Conclusion

  3. BDAS(the Berkeley Data Analytics Stack)

  4. Spark • Master/Slave architecture • In-Memory Computing Platform • Resilient Distributed Datasets(RDDs) abstraction • DAG Engine Execution • Fault Recovery using Lineage • Supporting Interactive data mining • Survived in Hadoop Ecosystem

  5. Spark – In-Memory Computing • Hadoop: two-stages MapReduce topology • Spark: DAG executing topology

  6. Spark – RDD(Resilient Distributed Datasets) • Data Computing and Storage Abstraction • Records organized by Partition • Immutable(Read Only) • only be created through transformations • Move computing instead of data • Coarse-grained Programming Interfaces • Partition Reconstruction using lineage

  7. Spark-RDD: Coarse-grained Programming Interfaces • Transformation: defining one or more RDDs. • Action: return a value. • Lazy computation

  8. Spark-RDD: Lineage graph • Spark Example: • Return the time fields of web GET access by “66.220.128.123”(time field is number 3 in a tab-separated format) logs = spark.textFile("hdfs://...") accessesByIp = logs.filter(_.startWith("66.220.128.123")) accessesByIp.persist() accessesByIp.filter(_.contains("GET")) .map(_.split('\t')(3)) .collect()

  9. Spark - installation • Running Spark • Local Mode • Standalone Mode • Cluster Mode: on YARN/Mesos • Support Programming Language • Scala • Java • Python • Spark Interactive Shell

  10. Other Components(1) - Shark • SQL and Rich Analytics at Scale. • Partial DAG Execution optimize query planning at runtime. • run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop.

  11. Other Components(1) – Spark Streaming • Scalable fault-tolerant streaming computing framework • Discretized streams abstraction: separate continuous dataflow into batches of input data

  12. Other Components(2) - MLBase • Distributed Machine Learning Library

  13. Other Components(4) – GraphX • Distributed Graph Computing framework • RDG(Resilient Distributed Graph) abstraction • Supporting Gather-Apply-Scatter model in GraphLab

  14. GAS model - GraphLab Mirror Mirror Mirror Master Machine 1 Machine 2 Gather Infos From Nbs Y’ Y’ Y’ Y Y Y’ Y Σ Σ1 Σ2 Y + + + Apply Vertex update Machine 3 Machine 4 Σ3 Σ4 From Jiang Wenrui’s thesis defense

  15. Introduction to Scala(1) • Runs on JVM(and .Net) • Full interoperability with Java • Statically typed • Object Oriented • Functional Programming

  16. Introduction to Scala(2) • Declare a list of integers • valints = List(1,2,4,5,7,3) • Declare a function, cube, compute the cube of an Int • def cube(a: Int): Int = a * a * a • Apply cube function to list. • val cubes = ints.map(x => cube(x)) • Sum the cube of integers. • cubes.reduce((x1,x2) => x1+x2) • Define a factorial function that computn! • deffact(n: Int): Int = { if(n == 0) 1 else n * fact(n-1)}

  17. Spark in Practices(1) • “Hello Word”(interactive shell) • valtextFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey((a, b) => a+b) • “Hello Word”(Standalone App)

  18. Spark in Practices(2) • PageRank in Spark

  19. Debugging and Testing Spark Programs(1) • Running in Local Mode • sc = new SparkContext("local", name) • Debug in IDE • Running in Standalone/Cluster Mode • Job Web GUI on 8080/4040 • Log4j • jstack/jmap • dstat/iostat/lsof -p • Unit test • Test in local mode

  20. Debugging and Testing Spark Programs(2) RDDJoin2: (2,4) (1,2) RDDJoin3: (1,(1,3)) (1,(2,3))

  21. Debugging and Testing Spark Programs(3) • Tunning Spark • Large object in lambda operator should be replaced by broadcast variables instead. • Coalescing partitions avoid large number of empty tasks after filtering operations. • Make good use of partition for data locality.(mapPartitions) • Good choice of partitioning key to balance data • Set spark.local.dir to set of disks • Take care of the number of reduce tasks • Don’t collect data but write to HDFS directly.

  22. Learning Spark • Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html • Holden Karau, Fast Data Processing with Spark • Spark Docs, http://spark.apache.org/docs/latest/ • Spark Source code, https://github.com/apache/spark • Spark User Mailing list, http://spark.apache.org/mailing-lists.html

  23. Why are Previous MapReduce-Based Systems Slow? • Conventional thoughts: • expensive data materialization for fault tolerance, • inferior data layout (e.g., lack of indices), • costlier execution strategies. • But Hive/Shark alleviate these by: • In-memory computing and storage • Partial DAG execution • Experiment results in Shark: • Intermediate Outputs • Data Format and Layout by co-partitioning • Execution Strategies optimizing using PDE • Task Scheduling Cost

  24. Conclusion • Spark • In-Memory computing platform for iterative and interactive tasks • RDD abstraction • Lineage reconstruction for fault recovery • Large number of components based on • Spark Programming • Just think RDD like vector • Function programming • Scala IDE is not strong enough. • Lack of good tools to debug and test.

More Related