350 likes | 1.02k Views
Apache Spark. September 25 th 2017 Kyung Eun Park, D.Sc. kpark@towson.edu. Contents. Apache Spark Key Ideas Features. Spark as a full featured processing engine. Apache Spark: a new framework for distributed, in-memory data processing Spark is not a modified version of Hadoop
E N D
Apache Spark September 25th2017 Kyung Eun Park, D.Sc. kpark@towson.edu
Contents • Apache Spark • Key Ideas • Features
Spark as a full featured processing engine • Apache Spark: a new framework for distributed, in-memory data processing • Spark is not a modified version of Hadoop • Not dependent on Hadoop with its own cluster management computation • Use Hadoop as a storage and processing engine • Apache Spark • Fast cluster computing technology • Designed for fast computation • Based on Hadoop MapReduce, extends the MapReduce model with • Interactive queries • Stream processing • Main IDEA: in-memory cluster computing • Covers batch application, iterative algorithms, interactive queries, and streaming • Interactive data processing feature with support for Scala or Python: effective for pre-processing of data
'dfs[a-z.]+ Sparks Evolution • In 2009, developed as one of Hadoop’s sub project in UC Berkeley’s AMPLab by MateiZaharia • Open Sourced in 2010 under a BSD license • Donated to Apache software foundation in 2013 • Became top-level Apache project in Feb. 2014
Features of Apache Spark • Speed • Up to 100 times faster when running in memory • Up to 10 times faster when running on disk • Achieved by reducing number of read/write operations to disk • Intermediate processing data in memory • Supports multiple languages • Built-in APIs in Java, Scala, or Python • 80 high-level operators for interactive querying • Advanced Analytics • Not only support ‘Map’ and ‘Reduce’, • But also, SQL queries, Streaming data, Machine learning (ML), and Graph algorithm • Higher-level abstraction than Hadoop/MapReduce APIs
Spark on Hadoop https://databricks.com/blog/2014/01/21/spark-and-hadoop.html
Spark Deployment • Standalone • Spark occupies the place on top of HDFS • Space is allocated for HDFS, explicitly • Spark and MapReduce run side by side to cover all spark jobs on cluster • Hadoop Yarn • Spark runs on Yarn without any pre-installation or root access required • Spark within Hadoop ecosystem or Hadoop stack • Spark in MapReduce (SIMR) • Used to launch Spark job in addition to standalone deployment • User can start Spark and uses its shell without any administrative access
Apache Spark Architectural Concepts and Key Terms (1) • Spark Cluster: a collection of machines or nodes in the public cloud or in a private data center/cloud on which Spark is installed. It includes Spark workers, a Spark master (also a cluster manager in a Standalone mode), and at least one Spark Driver. • Spark Master: a Spark Master JVM acts as cluster manager in a Standalone deployment mode to which Spark workers register. Acts as a resource manager and decides how many executors to launch, and on what Spark workers in the cluster. • Spark Worker: Spark worker JVM launches Executors on the worker of the Spark Driver. Spark applications, decomposed into units of tasks, are executed on each worker’s Executor. It launches an Executor on behalf of the master
Apache Spark Architectural Concepts and Key Terms (2) • Spark Executor: a JVM container with an allocated amount of cores and memory on which Spark runs its tasks. Also stores and caches all data partitions in its memory. • Spark Driver: Using information from the Spark Master in the cluster, the driver program distributes Spark tasks to each worker’s Executor. Spark Driver JVM Spark Executor Spark Executor Spark Executor Spark Executor Slot Slot Slot Slot Slot Slot Slot Slot
Apache Spark Architectural Concepts and Key Terms (3) • SparkContext: a conduit to access all Spark functionality. A single SparkContext per JVM
Spark Architecture • Spark provides many more capabilities than its original RDD-based processing engine • Spark SQL: traditional SQL queries + the DataFrames API for relational algebra processing of large datasets (get SQL results without disk writing) • Spark MLlib: machine-learning library with many machine-learning algorithms • Spark GraphX: a library for graphs and graph-parallel computations with common algorithms • Spark Streaming: enables to build scalable fault tolerant streaming applications like Apache Storm MLlib (Machine Learning) GraphX(Graph) Spark SQL Spark Streaming Apache Spark Core
Spark Components • Apache Spark Core • General execution engine for Spark platform • Provides In-Memory computing and referencing datasets in external storage systems • Spark SQL • Introduces a new data abstraction called SchemaRDDfor structured and semi-structured data • MLlib(Machine Learning Library) • MLlib as a distributed machine learning framework above Spark • Includes common algorithms: linear and logistic regression, support vector machines (SVM), decision trees and random forest, k-means clustering, singular value decomposition (SVD), etc. • 9 times as fast as the Hadoop disk-based version of Apache Mahout with a Spark interface from Mahout • GraphX • Distributed graph-processing framework • API for expressing graph computation to model the user-defined graphs • Common algorithms: PageRank, label propagation, triangle count, etc. • Spark Streaming • Leverages Spark Core’s fast scheduling capability to perform streaming analytics • Ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data
Spark Streaming • Extension to the normal Spark API • Process data in slices • A discretized stream (DStream) as the Spark streaming dataset abstraction • Processing made on these discretized segments of the stream Batches of input data Input data stream Batches of processed data Spark Streaming Spark Engine Discretized stream
Iterative Operations on Spark RDD • Store intermediate results in a distributed memory instead of stable storage (disk) Iteration - n Iteration - 2 Iteration - 1 HDFS read HDFS write MR1 MR1 read MR1 write read write Distributed Memory Tuples (on Disk) Data on Disk Distributed Memory MR2 MR2 MR2 MR3 MR3 MR3 Input from stable storage Output to stable storage
Interactive Operations on Spark RDD • If different queries are running on the same set of data repeatedly the dataset can be kept in memory Query1 Result HDFS read Distributed Memory Data on Disk Result Query 2 One Time Processing Result Query 3
MapReduce vs. Spark • MapReduce • Data sharing is slow in MapReduce due to replication, serialization, and disk IO • Most of the Hadoop applications spend more than 90% of the time doing HDFS read-write operations • Works well at one-pass computation (first map(), then reduce()): inefficient for multipass algorithms • Spark • Using Resilient Distributed Datasets (RDD) • Not tied to a map phase and then a reduce phase: can be a DAG (directed acyclic graph) of many map and/or reduce/shuffle phases • Supports in-memory processing computation • Stores the state of memory as an object across the jobs • The object is sharable between those jobs • Data sharing in memory is 10 to 100 times faster than network and disk
Resilient Distributed Dataset (RDD) • Spark’s main abstraction with various relational algebra operators and transformation logic • An immutable distributed collection of items • Created from Hadoop input formats (HDFS files) or by transforming other RDDs • Each RDD is divided into logical partitions, transparently computed on different nodes of the cluster • Relational algebra operators: • Select • Filter • Join • Group by • Transformations logic in Scala or Python on RDDs • Spark DataFrames as the other data abstraction • Built on top of an RDD (Note: refer to slicing and dicing APIs within Python’s Pandas library) • Data are organized into named columns similar to a relational database table
Spark DataFrames • Spark DataFrames created from • Existing RDDs • Structured data files • JSON datasets • Hive tables • External databases • This leads to use Spark as a data ingestion tool to Hadoop: - CSV file - External SQL - NoSQL data store Apache Spark Hadoop
RDD to Dataset • Resilient Distributed Dataset (RDD): • Before Spark 2.0, the main programming interface of Spark • Dataset • After Spark 2.0, RDDs are replaced by Dataset • Strongly-typed like an RDD, but with richer optimizations • RDD interface is still supported (more complete reference at the RDD programming guide) • Highly recommended to switch to use Dataset: • Better performance than RDD. • See the SQL programming guide to get more information about Dataset.
Spark APIs: RDD Transformations • RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. • RDD transformation is a step in a program telling Spark how to get data and what to do with it • RDD transformations • map(func) : returns a new distributed dataset, formed by passing each element of the source through a function func • filter(func): returns a new dataset formed by selecting those elements of the source on which func returns true • flatMap(func): similar to map, but each input item can be mapped to 0 or more output items, so func should return a Seq. • groupByKey([numTasks]): when called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs, similar to shuffle/sort • reduceByKey(func, [numTasks]): returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V, V) => V like in groupByKey … https://www.tutorialspoint.com/apache_spark/apache_spark_core_programming.htm
Spark APIs: Actions • Actions • reduce(func) : aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel • collect(): returns all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data • count(): returns the number of elements in the dataset • first(): returns the first element of the dataset, similar to take(1) • take(n): returns an array with the first n elements of the dataset • saveAsTextFile(path): writes the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system • countByKey(): Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key • foreach(func): runs a function func on each element of the dataset https://www.tutorialspoint.com/apache_spark/apache_spark_core_programming.htm
Introduction to using Spark • API through Spark’s interactive shell (in Python or Scala) • How to write applications in Java, Scala, and Python https://spark.apache.org/docs/latest/quick-start.html
Interactive Analysis with the Spark Shell • Spark shell • Provides a simple way to learn the API • A powerful tool to analyze data interactively $ ./bin/spark-shell • Dataset • Spark’s primary abstraction, • Adistributed collection of items
Apache Spark Install • Java: Java 8 • Scala: http://www.scala-lang.org/download/ $ sudo apt install scala • Spark: https://spark.apache.org/docs/latest/ • Spark Shell • Provides an interactive shell as a tool to analyze data interactively • Available in either Scala or Python language • Open Spark Shell $ spark-shell scala> valinputfile=sc.textFile("README.md") scala> https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
Lab II-1: Word Count with Spark in Scala “input.txt” people are not as beautiful as they look, as they walk or as they talk. they are only as beautiful as they love, as they care as they share. • Word Count in Scala via the Spark Scala command shell: $ spark-shell • input.txt • Count the words in a file • Create a flat map for splitting each line into words using flatMap() flatMap(line => line.split(“ ”)) • Next, read each word as a key with a value ‘1’ (<key, value> = <word, 1>) using map function map() map(word => (word, 1)) • Finally, reduce those keys by adding values of similar keys using reduceByKey(_ + _) reuceByKey(_ + _) • Create a simple RDD from the text file scala> valinfile = sc.textFile(“./text_file.txt”) scala> val counts = infile.flatMap(line => line.split(“ ”)).map(word => (word, 1)).reduceByKey(_ + _); scala> counts.saveAsTextFile(“./wordcount”) • Check the output $ cd wordcount $ cat part-00000
Lab II-2: Import CSV Files into HIVE Using Spark (Python) • Run pyspark $ ./bin/pyspark • Spark with Phthon, first import functions necessary for Spark DataFrame operations >>> from pyspark.sql import HiveContext >>> from pyspark.sql.types import * >>> from pyspark.sql import Row • Import raw data into a Spark RDD (Note: USE ABSOLUTE PATH) >>> csv_data= sc.textFile(“file:///home/tiger/sparktest/equi.csv”) • Confirm RDD using type() command >>> type(csv_data) <class ‘pyspark.rdd.RDD’> • Split csv data using map() function >>> csv_data = csv_data.map(lambda p: p.split(“,”)) • Remove a header from the RDD >>> header = csv_data.first() >>> csv_data = csv_data.filter(lambda p:p != header) • Put the data in the csv_data RDD into a Spark SQL DataFrame using toDF() function >>> df_csv = csv_data.map(lambda p: Row(EmpID = int(p[0]), Fname = p[1], Title = p[2], Laptop = p[3])).toDF() or >>> df_csv = csv_data.map(lambda p: Row(EmpID = p[0], Fname = p[1], Title = p[2], Laptop = p[3])).toDF() • The structure and data of the first four rows of the df_csvDataFrame: >>> df_csv.show(5) • Display the DataFrame schema >>> df_csv.printSchema() • Store the DataFrame into a Hive table: Create a HiveContext used to store the DataFrame into a Hive table (in ORC format), by using the saveAsTable() command >>> from pyspark.sql import HiveContext >>> hc = HiveContext(sc) >>> df_csv.write.format(“orc”).saveAsTable(“employees”)
Apache Spark – Core Programming • Applications in Scala: use compatible Scala version • Linking with Spark • Spark Classes and implicit conversions into your program • Initializing Spark • RDDs • RDD Operations • Transformations • Actions • RDD Persistence (cache) in memory importorg.apache.spark.SparkContextimportorg.apache.spark.SparkContext._ newSparkContext(master,appName,[sparkHome],[jars]) https://www.tutorialspoint.com/apache_spark/apache_spark_core_programming.htm
Scala Programming Guide • Spark application • A driver program: runs the user’s main function and executes various parallel operations on the Spark cluster with many executors https://spark.apache.org/docs/0.9.0/scala-programming-guide.html
Reference • Jimmy Lin (at Univ. of Waterloo) https://lintool.github.io/bigdata-2017w/index.html • Spark Programming Guide: https://spark.apache.org/docs/0.9.0/scala-programming-guide.html • Python Programming Guide: https://spark.apache.org/docs/0.9.0/python-programming-guide.html • Apache Spark – Core Programming: https://www.tutorialspoint.com/apache_spark/apache_spark_core_programming.htm • Apache Spark Installation: https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm • Apache Spark Quick Start: https://spark.apache.org/docs/latest/quick-start.html • 7 Steps for a Develop to Learn Apache spark: http://go.databricks.com/hubfs/Landing_pages/blog-books/7-steps-for-a-developer-to-learn-apache-spark.pdf?t=1505973148714&utm_source=hs_automation&utm_medium=email&utm_content=43570218&_hsenc=p2ANqtz-_v6Jpz7Lp5CCEAsw5WnPM3pswpW3q1UHOf0keHGgiXp78HG-HuCn1kcgmLcuI6YmEgxE7dxZ5k2CKJJN6Q3PqzoJuA1A&_hsmi=43570218