450 likes | 1.15k Views
An Overview of Apache Spark. Jim Scott, Director, Enterprise Strategy and Architecture. July 10, 2014. Agenda. MapReduce Refresher What is Spark? The Difference with Spark Preexisting MapReduce Examples and Resources. MapReduce Refresher. MapReduce Basics.
E N D
An Overview of Apache Spark Jim Scott, Director, Enterprise Strategy and Architecture July 10, 2014
Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources
MapReduce Basics • Foundational model is based on a distributed file system • Scalability and fault-tolerance • Map • Loading of the data and defining a set of keys • Many use cases do not utilize a reduce task • Reduce • Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
Languages and Frameworks • Languages • Java, Scala, Clojure • Python, Ruby • Higher Level Languages • Hive • Pig • Frameworks • Cascading, Crunch • DSLs • Scalding, Scrunch, Scoobi, Cascalog
MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together • Use a higher level language or DSL that does this for you
Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
Spark is the Most Active Open Source Project in Big Data Project contributors in past year
Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) GraphX(Graph computation) Spark (General execution engine) • Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • BlinkDB(Approximate Queries) • SparkR (R wrapper for Spark)
Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent • … ** Mahout is no longer accepting MapReduce algorithm submissions in lieu of Spark
Data Sources • Local Files • file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem • Regular files, sequence files, any other Hadoop InputFormat • HBase
Deploying Spark – Cluster Manager Types • Mesos • EC2 • GCE • Standalone mode • YARN
Supported Languages • Java • Scala • Python • Hive?
The Spark Stack from 100,000 ft • Spark ecosystem • 4 • Spark core engine • 3 • Execution environment • 2 • Data platform • 1
Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage Up to 10× faster on disk,100× in memory 2-5× less code
Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel • Parallelized Collection: Scala collection which is run in parallel • Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Directed Acylic Graph (DAG) • Directed • Only in a single direction • Acyclic • No looping • Why does this matter? • This supports fault-tolerance
RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter(func = startsWith(…)) map(func = split(...))
RDDOperations • Transformations • Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions • Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
RDD Persistence / Caching • Variety of storage levels • memory_only (default), memory_and_disk, etc… • API Calls • persist(StorageLevel) • cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations • Read from disk vs. recompute (memory_and_disk) • Total memory storage size (memory_only_ser) • Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
Higher throughput than Storm Spark Streaming: 670krecords/sec/node Storm: 115krecords/sec/node Commercial systems: 100-500k records/sec/node Comparison to Storm
Interactive Shell • Iterative Development • Cache those RDDs • Open the shell and ask questions • We have all wished we could do this with MapReduce • Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
Existing Jobs • Java MapReduce • Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts • Port them over • Try SPORK! • Hive Queries….
Spark SQL • Shark is officially dead, long-live Spark SQL • Hive-compatible (HiveQL, UDFs, metadata) • Works in existing Hive warehouses without changing queries or data! • Augments Hive • In-memory tables and columnar memory store • Fast execution engine • Uses Spark as the underlying execution engine • Low-latency, interactive queries • Scale-out and tolerates worker failures
Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) • interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning • There are switches you can use to optimize your work
Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configurationhttp://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configurationhttp://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guidehttp://spark.apache.org/docs/latest/tuning.html
Resources • Pig on Spark • http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html • https://github.com/aniket486/pig • https://github.com/twitter/pig/tree/spork • http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 • https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark • http://databricks.com/categories/spark/ • http://www.spark-stack.org/
Q & A Engage with us! maprtech @kingmesal MapR mapr-technologies maprtech jscott@mapr.com