1 / 69

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

This Edureka Spark Streaming Tutorial will help you understand how to use Spark Streaming to stream data from twitter in real-time and then process it for Sentiment Analysis. This Spark Streaming tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:<br><br>1) What is Streaming?<br>2) Spark Ecosystem<br>3) Why Spark Streaming?<br>4) Spark Streaming Overview<br>5) DStreams<br>6) DStream Transformations<br>7) Caching/ Persistence<br>8) Accumulators, Broadcast Variables and Checkpoints<br>9) Use Case u2013 Twitter Sentiment Analysis

EdurekaIN
Download Presentation

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  2. What to expect?  What is Streaming?  Spark Ecosystem  Why Spark Streaming?  Spark Streaming Overview  DStreams  DStream Transformations  Caching/ Persistence  Accumulators, Broadcast Variables and Checkpoints  Use Case – Twitter Sentiment Analysis EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  3. What is Streaming? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  4. What is Streaming?  Data Streaming is a technique for transferring data so that it can be processed as a steady and continuous stream. “Without stream processing there’s no big data and no Internet of Things” – Dana Sandu, SQLstream  Streaming technologies are becoming increasingly important with the growth of the Internet. User Streaming Sources Live Stream Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  5. Spark Ecosystem EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  6. Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  7. Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  8. Why Spark Streaming? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  9. Why Spark Streaming? Spark Streaming is used to stream real-time data from various sources like Twitter, Stock Market and Geographical perform powerful analytics to help businesses. Systems and We will be using Spark Streaming to perform Twitter Sentiment Analysis which is by companies around the world. We will explore the same after we learn all the concepts of Spark Streaming. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  10. Spark Streaming Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  11. Spark Streaming Overview  Spark Streaming is used for processing real-time streaming data  It is a useful addition to the core Spark API  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams  The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  12. Spark Streaming Features Efficiently recover from failures Fault Tolerance Integrates with batch and real-time processing Integration Speed Achieves low latency Used to track behaviour of customers Scales to hundreds of nodes Business Analysis Scaling EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  13. Spark Streaming Workflow Streaming Data Sources MLlib Machine Learning Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  14. Spark Streaming Workflow Kafka Flume HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Twitter Figure: Data from a variety of sources to various storage systems EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  15. Spark Streaming Workflow Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  16. Spark Streaming Workflow Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 Figure: Input data stream divided into discrete chunks of data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  17. Spark Streaming Workflow Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 1 to 2 Words From Time 2 to 3 Words From Time 3 to 4 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  18. Streaming Fundamentals EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  19. Streaming Fundamentals The following gives a flow of the fundamentals of Spark Streaming that we will discuss in the coming slides: 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  20. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  21. Streaming Context StreamingContext Batches Of Input Data  Consumes a stream of data in Spark.  Registers an InputDStream to produce a Receiver object. Input Data Stream Streaming Context Figure: Spark Streaming Context  It is the main entry point for Spark functionality.  Spark provides a number of default implementations of sources like Twitter, Akka Actor and ZeroMQ that are accessible from the context. Figure: Default Implementation Sources EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  22. Streaming Context – Initialization  A StreamingContext object can be created from a SparkContext object.  A SparkContext represents the connection to a Spark cluster and can be used to create RDDs, accumulators and broadcast variables on that cluster. import org.apache.spark._ import org.apache.spark.streaming._ var ssc = new StreamingContext(sc,Seconds(1)) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  23. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  24. DStream  Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming.  It is a continuous stream of data. RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 Figure: Input data stream divided into discrete chunks of data  It is received from source or from a processed data stream generated by transforming the input stream.  Internally, a DStream is represented by a continuous series of RDDs and each RDD contains data from a certain interval. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  25. DStream Operation  Any operation applied on a DStream translates to operations on the underlying RDDs.  For example, in the example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 1 to 2 Words From Time 2 to 3 Words From Time 3 to 4 Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  26. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.2 2.1 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  27. Input DStreams Input DStream Input DStreams are DStreams representing the stream of input data received from streaming sources. Basic Source Advanced Source File Kafka Systems RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 Socket Connections Flume Figure: Input data stream divided into discrete chunks of data Kinesis EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  28. Receiver Every input DStream is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing. Figure: The Receiver sends data onto the DStream where each Batch contains RDDs EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  29. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.2 2.1 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  30. Transformations on DStreams Most Popular Spark Streaming Transformations Transformations allow the data from the input DStream to be modified similar to RDDs. DStreams support many of the transformations available on normal Spark RDDs. map DStream 1 DStream 2 flatMap Transform filter Drop split point reduce groupBy Figure: DStream Transformations EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  31. Transformations on DStreams map(func) map(func) returns a new DStream by passing each element of the source DStream through a function func. map flatMap Batches Of Input Data filter Input Data Stream Node map reduce Figure: Input DStream being converted through map(func) Figure: Map Function groupBy EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  32. Transformations on DStreams flatMap(func) flatMap(func) is similar to map(func) but each input item can be mapped to 0 or more output items and returns a new DStream by passing each source element through a function func. map flatMap filter Batches Of Input Data Input Data Stream Node flatMap reduce Figure: Input DStream being converted through flatMap(func) groupBy Figure: flatMap Function EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  33. Transformations on DStreams filter(func) filter(func) returns a new DStream by selecting only the records of the source DStream on which func returns true. map flatMap filter Batches Of Input Data Input Data Stream filter Node reduce Figure: Input DStream being converted through filter(func) groupBy Figure: Filter Function For Even Numbers EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  34. Transformations on DStreams reduce(func) reduce(func) returns a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func. map flatMap filter Batches Of Input Data Input Data Stream reduce Node reduce Figure: Input DStream being converted through reduce(func) groupBy Figure: Reduce Function To Get Cumulative Sum EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  35. Transformations on DStreams groupBy(func) groupBy(func) returns the new RDD which basically is made up with a key and corresponding list of items of that group. map flatMap filter Batches Of Input Data Input Data Stream groupBy Node reduce groupBy Figure: Input DStream being converted through groupBy(func) Figure: Grouping By First Letters EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  36. DStream Window EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  37. DStream Window  Spark Streaming also provides windowed computations which allow us to apply transformations over a sliding window of data.  The following figure illustrates this sliding window: Figure: DStream Window Transformation EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  38. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.2 2.1 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  39. Output Operations on DStreams  Output operations allow DStream’s data to be pushed out to external systems like databases or file systems.  Output operations trigger the actual execution of all the DStream transformations. External Systems Database Output DStream Transformed DStream Output Operations File System Figure: Output Operations on DStreams EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  40. Output Operations on DStreams Currently, the following output operations are defined: Output Operations saveAsTextFiles (prefix, [suffix]) saveAsObjectFiles (prefix, [suffix]) saveAsHadoopFiles (prefix, [suffix]) print() foreachRDD(func) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  41. Output Operations Example - foreachRDD foreachRDD  dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems.  The lazy evaluation achieves the most efficient transfer of data. dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) // Return to the pool for future reuse ConnectionPool.returnConnection(connection) } } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  42. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  43. Caching/Persistence  DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times.  This can be done using the persist() method on a DStream.  For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance. Figure: Caching Into 2 Nodes EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  44. 1 2 3 4 Accumulators, Broadcast Variables and Checkpoints Streaming Context DStream Caching 2.3 2.1 2.2 Output DStream DStream Transformations Input DStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  45. Accumulators  Accumulators are variables that are only added through an associative and commutative operation.  They are used to implement counters or sums.  Tracking accumulators in the UI can be useful for understanding the progress of running stages  Spark natively supports numeric accumulators. We can create named or unnamed accumulators. Figure: Accumulators In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  46. Broadcast Variables  Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.  They can be used to give every node a copy of a large input dataset in an efficient manner.  Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. BroadcastManager newBroadcast[T](value, isLocal) broadcast(value) SparkContext registerBroadcastForCleanup ContextCleaner Figure: SparkContext and Broadcasting Figure: Broadcasting A Value To Executors EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  47. Checkpoints Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic. It is the saving of the information defining the streaming computation It is saving of the generated RDDs to reliable storage Metadata Checkpoints Data Checkpoints Checkpoints EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  48. Use Case - Twitter Sentiment Analysis EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  49. Use Case – Twitter Sentiment Analysis Trending Topics can be used to create campaigns and attract larger audience. Sentiment Analytics helps in crisis management, service adjusting and target marketing.  Sentiment refers to the emotion behind a social media mention online.  Sentiment Analysis is categorising the tweets related to particular topic and performing data mining using Sentiment Automation Analytics Tools.  We will be performing Twitter Sentiment Analysis as our Use Case for Spark Streaming. Figure: Facebook And Twitter Trending Topics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  50. Use Case – Problem Statement Problem Statement To design a Twitter Sentiment Analysis System where we populate real time sentiments for crisis management, service adjusting and target marketing Sentiment Analysis is used to:  Predict the success of a movie  Predict political campaign success  Decide whether to invest in a certain company  Targeted advertising  Review products and services Figure: Twitter Sentiment Analysis For Nike Figure: Twitter Sentiment Analysis For Adidas EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

More Related