1 / 36

An Overview of Apache Spark

An Overview of Apache Spark. Jim Scott, Director, Enterprise Strategy and Architecture. July 10, 2014. Agenda. MapReduce Refresher What is Spark? The Difference with Spark Preexisting MapReduce Examples and Resources. MapReduce Refresher. MapReduce Basics.

brandiec
Download Presentation

An Overview of Apache Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview of Apache Spark Jim Scott, Director, Enterprise Strategy and Architecture July 10, 2014

  2. Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources

  3. MapReduce Refresher

  4. MapReduce Basics • Foundational model is based on a distributed file system • Scalability and fault-tolerance • Map • Loading of the data and defining a set of keys • Many use cases do not utilize a reduce task • Reduce • Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)

  5. Languages and Frameworks • Languages • Java, Scala, Clojure • Python, Ruby • Higher Level Languages • Hive • Pig • Frameworks • Cascading, Crunch • DSLs • Scalding, Scrunch, Scoobi, Cascalog

  6. MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together • Use a higher level language or DSL that does this for you

  7. What is Spark?

  8. Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation

  9. The Spark Community

  10. Spark is the Most Active Open Source Project in Big Data Project contributors in past year

  11. Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) GraphX(Graph computation) Spark (General execution engine) • Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • BlinkDB(Approximate Queries) • SparkR (R wrapper for Spark)

  12. Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent • … ** Mahout is no longer accepting MapReduce algorithm submissions in lieu of Spark

  13. Data Sources • Local Files • file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem • Regular files, sequence files, any other Hadoop InputFormat • HBase

  14. Deploying Spark – Cluster Manager Types • Mesos • EC2 • GCE • Standalone mode • YARN

  15. Supported Languages • Java • Scala • Python • Hive?

  16. The Spark Stack from 100,000 ft • Spark ecosystem • 4 • Spark core engine • 3 • Execution environment • 2 • Data platform • 1

  17. The Difference with Spark

  18. Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage Up to 10× faster on disk,100× in memory 2-5× less code

  19. Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel • Parallelized Collection: Scala collection which is run in parallel • Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

  20. Directed Acylic Graph (DAG) • Directed • Only in a single direction • Acyclic • No looping • Why does this matter? • This supports fault-tolerance

  21. RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter(func = startsWith(…)) map(func = split(...))

  22. RDDOperations • Transformations • Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions • Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

  23. RDD Persistence / Caching • Variety of storage levels • memory_only (default), memory_and_disk, etc… • API Calls • persist(StorageLevel) • cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations • Read from disk vs. recompute (memory_and_disk) • Total memory storage size (memory_only_ser) • Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

  24. Cache Scaling Matters

  25. Higher throughput than Storm Spark Streaming: 670krecords/sec/node Storm: 115krecords/sec/node Commercial systems: 100-500k records/sec/node Comparison to Storm

  26. Interactive Shell • Iterative Development • Cache those RDDs • Open the shell and ask questions • We have all wished we could do this with MapReduce • Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark

  27. Preexisting MapReduce

  28. Existing Jobs • Java MapReduce • Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts • Port them over • Try SPORK! • Hive Queries….

  29. Spark SQL • Shark is officially dead, long-live Spark SQL • Hive-compatible (HiveQL, UDFs, metadata) • Works in existing Hive warehouses without changing queries or data! • Augments Hive • In-memory tables and columnar memory store • Fast execution engine • Uses Spark as the underlying execution engine • Low-latency, interactive queries • Scale-out and tolerates worker failures

  30. Examples and Resources

  31. Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) • interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)

  32. Network Word Count – Streaming

  33. Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning • There are switches you can use to optimize your work

  34. Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configurationhttp://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configurationhttp://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guidehttp://spark.apache.org/docs/latest/tuning.html

  35. Resources • Pig on Spark • http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html • https://github.com/aniket486/pig • https://github.com/twitter/pig/tree/spork • http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 • https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark • http://databricks.com/categories/spark/ • http://www.spark-stack.org/

  36. Q & A Engage with us! maprtech @kingmesal MapR mapr-technologies maprtech jscott@mapr.com

More Related