Apache Spark Overview: Big Data Processing Made Easy

An Overview of Apache Spark Jim Scott, Director, Enterprise Strategy and Architecture July 10, 2014

Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources

MapReduce Refresher

MapReduce Basics • Foundational model is based on a distributed file system • Scalability and fault-tolerance • Map • Loading of the data and defining a set of keys • Many use cases do not utilize a reduce task • Reduce • Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)

Languages and Frameworks • Languages • Java, Scala, Clojure • Python, Ruby • Higher Level Languages • Hive • Pig • Frameworks • Cascading, Crunch • DSLs • Scalding, Scrunch, Scoobi, Cascalog

MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together • Use a higher level language or DSL that does this for you

What is Spark?

Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation

The Spark Community

Spark is the Most Active Open Source Project in Big Data Project contributors in past year

Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) GraphX(Graph computation) Spark (General execution engine) • Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • BlinkDB(Approximate Queries) • SparkR (R wrapper for Spark)

Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent • … ** Mahout is no longer accepting MapReduce algorithm submissions in lieu of Spark

Data Sources • Local Files • file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem • Regular files, sequence files, any other Hadoop InputFormat • HBase

Deploying Spark – Cluster Manager Types • Mesos • EC2 • GCE • Standalone mode • YARN

Supported Languages • Java • Scala • Python • Hive?

The Spark Stack from 100,000 ft • Spark ecosystem • 4 • Spark core engine • 3 • Execution environment • 2 • Data platform • 1

The Difference with Spark

Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage Up to 10× faster on disk,100× in memory 2-5× less code

Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel • Parallelized Collection: Scala collection which is run in parallel • Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Directed Acylic Graph (DAG) • Directed • Only in a single direction • Acyclic • No looping • Why does this matter? • This supports fault-tolerance

RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter(func = startsWith(…)) map(func = split(...))

RDDOperations • Transformations • Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions • Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

RDD Persistence / Caching • Variety of storage levels • memory_only (default), memory_and_disk, etc… • API Calls • persist(StorageLevel) • cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations • Read from disk vs. recompute (memory_and_disk) • Total memory storage size (memory_only_ser) • Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Cache Scaling Matters

Higher throughput than Storm Spark Streaming: 670krecords/sec/node Storm: 115krecords/sec/node Commercial systems: 100-500k records/sec/node Comparison to Storm

Interactive Shell • Iterative Development • Cache those RDDs • Open the shell and ask questions • We have all wished we could do this with MapReduce • Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark

Preexisting MapReduce

Existing Jobs • Java MapReduce • Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts • Port them over • Try SPORK! • Hive Queries….

Spark SQL • Shark is officially dead, long-live Spark SQL • Hive-compatible (HiveQL, UDFs, metadata) • Works in existing Hive warehouses without changing queries or data! • Augments Hive • In-memory tables and columnar memory store • Fast execution engine • Uses Spark as the underlying execution engine • Low-latency, interactive queries • Scale-out and tolerates worker failures

Examples and Resources

Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) • interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)

Network Word Count – Streaming

Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning • There are switches you can use to optimize your work

Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configurationhttp://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configurationhttp://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guidehttp://spark.apache.org/docs/latest/tuning.html

Resources • Pig on Spark • http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html • https://github.com/aniket486/pig • https://github.com/twitter/pig/tree/spork • http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 • https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark • http://databricks.com/categories/spark/ • http://www.spark-stack.org/

Q & A Engage with us! maprtech @kingmesal MapR mapr-technologies maprtech jscott@mapr.com

Apache Spark Overview: Big Data Processing Made Easy

Apache Spark Overview: Big Data Processing Made Easy

Presentation Transcript

An introduction to Apache Spark

Using Apache Spark

Overview of Spark project

Introduction to Apache Spark

Overview of DC Spark Systems

Apache Cordova Overview

Parallel Programming With Apache Spark

An Overview of Apache Spark

Hadoop vs Apache Spark

Apache Spark Courses Online

Apache spark training institute

Apache Spark Training | Best Spark Online Training-GOT

Apache Spark

Apache Spark Training | Best Spark Online Training-GOT

Apache spark Interview Questions 2019.pdf

Introduction to Apache Spark

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka

Apache Spark Scala Training

An introduction about the Apache Spark Framework

Apache Spark - Introduction

Introduction to Apache Spark

Apache Spark