Spark Streaming

EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What to expect?  What is Streaming?  Spark Ecosystem  Why Spark Streaming?  Spark Streaming Overview  DStreams  DStream Transformations  Caching/ Persistence  Accumulators, Broadcast Variables and Checkpoints  Use Case – Twitter Sentiment Analysis EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What is Streaming? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What is Streaming?  Data Streaming is a technique for transferring data so that it can be processed as a steady and continuous stream. “Without stream processing there’s no big data and no Internet of Things” – Dana Sandu, SQLstream  Streaming technologies are becoming increasingly important with the growth of the Internet. User Streaming Sources Live Stream Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Ecosystem EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Why Spark Streaming? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Why Spark Streaming? Spark Streaming is used to stream real-time data from various sources like Twitter, Stock Market and Geographical perform powerful analytics to help businesses. Systems and We will be using Spark Streaming to perform Twitter Sentiment Analysis which is by companies around the world. We will explore the same after we learn all the concepts of Spark Streaming. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Overview  Spark Streaming is used for processing real-time streaming data  It is a useful addition to the core Spark API  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams  The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Features Efficiently recover from failures Fault Tolerance Integrates with batch and real-time processing Integration Speed Achieves low latency Used to track behaviour of customers Scales to hundreds of nodes Business Analysis Scaling EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Workflow Streaming Data Sources MLlib Machine Learning Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Workflow Kafka Flume HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Twitter Figure: Data from a variety of sources to various storage systems EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Workflow Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Workflow Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 Figure: Input data stream divided into discrete chunks of data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Workflow Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 1 to 2 Words From Time 2 to 3 Words From Time 3 to 4 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Streaming Fundamentals EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Streaming Fundamentals The following gives a flow of the fundamentals of Spark Streaming that we will discuss in the coming slides: 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Streaming Context StreamingContext Batches Of Input Data  Consumes a stream of data in Spark.  Registers an InputDStream to produce a Receiver object. Input Data Stream Streaming Context Figure: Spark Streaming Context  It is the main entry point for Spark functionality.  Spark provides a number of default implementations of sources like Twitter, Akka Actor and ZeroMQ that are accessible from the context. Figure: Default Implementation Sources EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Streaming Context – Initialization  A StreamingContext object can be created from a SparkContext object.  A SparkContext represents the connection to a Spark cluster and can be used to create RDDs, accumulators and broadcast variables on that cluster. import org.apache.spark._ import org.apache.spark.streaming._ var ssc = new StreamingContext(sc,Seconds(1)) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

DStream  Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming.  It is a continuous stream of data. RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 Figure: Input data stream divided into discrete chunks of data  It is received from source or from a processed data stream generated by transforming the input stream.  Internally, a DStream is represented by a continuous series of RDDs and each RDD contains data from a certain interval. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

DStream Operation  Any operation applied on a DStream translates to operations on the underlying RDDs.  For example, in the example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 1 to 2 Words From Time 2 to 3 Words From Time 3 to 4 Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Input DStreams Input DStream Input DStreams are DStreams representing the stream of input data received from streaming sources. Basic Source Advanced Source File Kafka Systems RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 Socket Connections Flume Figure: Input data stream divided into discrete chunks of data Kinesis EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Receiver Every input DStream is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing. Figure: The Receiver sends data onto the DStream where each Batch contains RDDs EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Transformations on DStreams Most Popular Spark Streaming Transformations Transformations allow the data from the input DStream to be modified similar to RDDs. DStreams support many of the transformations available on normal Spark RDDs. map DStream 1 DStream 2 flatMap Transform filter Drop split point reduce groupBy Figure: DStream Transformations EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Transformations on DStreams map(func) map(func) returns a new DStream by passing each element of the source DStream through a function func. map flatMap Batches Of Input Data filter Input Data Stream Node map reduce Figure: Input DStream being converted through map(func) Figure: Map Function groupBy EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Transformations on DStreams flatMap(func) flatMap(func) is similar to map(func) but each input item can be mapped to 0 or more output items and returns a new DStream by passing each source element through a function func. map flatMap filter Batches Of Input Data Input Data Stream Node flatMap reduce Figure: Input DStream being converted through flatMap(func) groupBy Figure: flatMap Function EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Transformations on DStreams filter(func) filter(func) returns a new DStream by selecting only the records of the source DStream on which func returns true. map flatMap filter Batches Of Input Data Input Data Stream filter Node reduce Figure: Input DStream being converted through filter(func) groupBy Figure: Filter Function For Even Numbers EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Transformations on DStreams reduce(func) reduce(func) returns a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func. map flatMap filter Batches Of Input Data Input Data Stream reduce Node reduce Figure: Input DStream being converted through reduce(func) groupBy Figure: Reduce Function To Get Cumulative Sum EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Transformations on DStreams groupBy(func) groupBy(func) returns the new RDD which basically is made up with a key and corresponding list of items of that group. map flatMap filter Batches Of Input Data Input Data Stream groupBy Node reduce groupBy Figure: Input DStream being converted through groupBy(func) Figure: Grouping By First Letters EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

DStream Window EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

DStream Window  Spark Streaming also provides windowed computations which allow us to apply transformations over a sliding window of data.  The following figure illustrates this sliding window: Figure: DStream Window Transformation EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Output Operations on DStreams  Output operations allow DStream’s data to be pushed out to external systems like databases or file systems.  Output operations trigger the actual execution of all the DStream transformations. External Systems Database Output DStream Transformed DStream Output Operations File System Figure: Output Operations on DStreams EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Output Operations on DStreams Currently, the following output operations are defined: Output Operations saveAsTextFiles (prefix, [suffix]) saveAsObjectFiles (prefix, [suffix]) saveAsHadoopFiles (prefix, [suffix]) print() foreachRDD(func) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Output Operations Example - foreachRDD foreachRDD  dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems.  The lazy evaluation achieves the most efficient transfer of data. dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) // Return to the pool for future reuse ConnectionPool.returnConnection(connection) } } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Caching/Persistence  DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times.  This can be done using the persist() method on a DStream.  For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance. Figure: Caching Into 2 Nodes EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Accumulators  Accumulators are variables that are only added through an associative and commutative operation.  They are used to implement counters or sums.  Tracking accumulators in the UI can be useful for understanding the progress of running stages  Spark natively supports numeric accumulators. We can create named or unnamed accumulators. Figure: Accumulators In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Broadcast Variables  Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.  They can be used to give every node a copy of a large input dataset in an efficient manner.  Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. BroadcastManager newBroadcast[T](value, isLocal) broadcast(value) SparkContext registerBroadcastForCleanup ContextCleaner Figure: SparkContext and Broadcasting Figure: Broadcasting A Value To Executors EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Checkpoints Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic. It is the saving of the information defining the streaming computation It is saving of the generated RDDs to reliable storage Metadata Checkpoints Data Checkpoints Checkpoints EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case - Twitter Sentiment Analysis EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Twitter Sentiment Analysis Trending Topics can be used to create campaigns and attract larger audience. Sentiment Analytics helps in crisis management, service adjusting and target marketing.  Sentiment refers to the emotion behind a social media mention online.  Sentiment Analysis is categorising the tweets related to particular topic and performing data mining using Sentiment Automation Analytics Tools.  We will be performing Twitter Sentiment Analysis as our Use Case for Spark Streaming. Figure: Facebook And Twitter Trending Topics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Case – Problem Statement Problem Statement To design a Twitter Sentiment Analysis System where we populate real time sentiments for crisis management, service adjusting and target marketing Sentiment Analysis is used to:  Predict the success of a movie  Predict political campaign success  Decide whether to invest in a certain company  Targeted advertising  Review products and services Figure: Twitter Sentiment Analysis For Nike Figure: Twitter Sentiment Analysis For Adidas EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

Presentation Transcript

An introduction to Apache Spark

Using Apache Spark

Spark Streaming Preview

Introduction to Apache Spark

Parallel Programming With Apache Spark

An Overview of Apache Spark

Hadoop vs Apache Spark

Apache Spark Courses Online

StreamAnalytix | Real-Time Big Data Streaming Analytics, Apache Spark Streaming

Apache spark training institute

Apache Spark Training | Best Spark Online Training-GOT

Apache Spark

Apache Spark Training | Best Spark Online Training-GOT

Introduction to Apache Spark

Apache Spark Scala Training

Apache Spark - Introduction

Introduction to Apache Spark

Apache Spark

Apache Spark and Scala Certification Training Course