1 / 37

Spark Streaming

Spark Streaming. Large-scale near- real-time stream processing. Tathagata Das (TD). UC BERKELEY. Motivation. Many important applications must process large data streams at second-scale latencies Check-ins, status updates, site statistics, spam filtering , …

tryna
Download Presentation

Spark Streaming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) UC BERKELEY

  2. Motivation • Many important applications must process large data streams at second-scale latencies • Check-ins, status updates, site statistics, spam filtering, … • Require large clusters to handle workloads • Require latencies of few seconds

  3. Case study: Conviva, Inc. • Real-time monitoring of online video metadata • Custom-built distributed streaming system • 1000s complex metrics on millions of videos sessions • Requires many dozens of nodes for processing • Hadoop backend for offline analysis • Generating daily and monthly reports • Similar computation as the streaming system Painful to maintain two stacks

  4. Goals • Framework for large-scale stream processing • Scalable to large clusters (~ 100 nodes) with near-real-time latency (~ 1 second) • Efficiently recovers from faults and stragglers • Simple programming model that integrates well with batch & interactive queries Existing system do not achieve all of them

  5. Existing Streaming Systems • Record-at-a-time processing model • Each node has mutable state • For each record, update state & send new records mutable state input records push node 1 node 3 input records node 2

  6. Existing Streaming Systems • Storm • Replays records if not processed due to failure • Processes each record at least once • May update mutable state twice! • Mutable state can be lost due to failure! • Trident – Uses transactions to update state • Processes each record exactly once • Per state transaction updates slow No integration with batch processing & Cannot handle stragglers

  7. Spark Streaming

  8. Discretized Stream Processing • Run a streaming computation as a series of very small, deterministic batch jobs • Batchprocessing models, like MapReduce, recover from faults and stragglers efficiently • Divide job into deterministic tasks • Rerun failed/slow tasks in parallel on other nodes • Same recovery techniques at lower time scales

  9. Spark Streaming • State between batches kept in memory as immutable, fault-tolerant dataset • Specifically as Spark’s Resilient Distributed Dataset • Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency • Potentially combine streaming and batch workloads to build a single unified stack

  10. Discretized Stream Processing batch operations input state / output immutable distributed dataset, stored in memoryas RDD immutable distributed dataset (replicated in memory) time = 0 - 1: input time = 1 - 2: … … … state stream input stream

  11. Fault Recovery • State stored as Resilient Distributed Dataset (RDD) • Deterministically re-computable parallel collection • Remembers lineage of operations used to create them • Fault / straggler recovery is done in parallel on other nodes operation input dataset (replicated and fault-tolerant) state RDD (not replicated) Fast recovery from faults without full data replication

  12. Programming Model • A Discretized Stream or DStream is a series of RDDs representing a stream of data • API very similar to RDDs • DStreams can be created… • Either from live streaming data • Or by transforming other DStreams

  13. DStream Data Sources • Many sources out of the box • HDFS • Kafka • Flume • Twitter • TCP sockets • Akka actor • ZeroMQ • Easy to add your own Contributed by external developers

  14. Transformations Build new streams from existing streams • RDD-like operations • map, flatMap, filter, count, reduce, • groupByKey, reduceByKey, sortByKey, join • etc. • New window and stateful operations • window, countByWindow, reduceByWindow • countByValueAndWindow, reduceByKeyAndWindow • updateStateByKey • etc.

  15. Output Operations Send data to outside world • saveAsHadoopFiles • print – prints on the driver’s screen • foreach - arbitrary operation on every RDD

  16. Example Process a stream of Tweets to find the 20 most popular hashtags in the last 10 mins • Get the stream of Tweets and isolate the hashtags • Count the hashtags over 10 minute window • Sort the hashtags by their counts • Get the top 20 hashtags

  17. 1. Get the stream of Hashtags valtweets = ssc.twitterStream(<username>, <password>) valhashtags= tweets.flatMap(status => getTags(status)) DStream transformation = RDD t-1 t+2 t+3 t t+1 t+4 tweets flatMap flatMap flatMap flatMap flatMap hashTags

  18. 2. Count the hashtags over 10 min sliding window operation valtweets = ssc.twitterStream(<username>, <password>) valhashtags= tweets.flatMap(status => getTags(status)) valtagCounts= hashTags.window(Minutes(10), Seconds(1)) .map(tag => (tag, 1)).reduceByKey(_ + _) t-1 t+2 t+3 t t+1 t+4 hashTags tagCounts

  19. 2. Count the hashtags over 10 min valtweets = ssc.twitterStream(<username>, <password>) valhashtags= tweets.flatMap(status => getTags(status)) valtagCounts= hashtags .countByValueAndWindow(Minutes(10), Seconds(1)) t-1 t+2 t+3 t t+1 t+4 hashTags – + + tagCounts

  20. Smart window-based reduce • Technique with count generalizes to reduce • Need a function to “subtract” • Applies to invertible reduce functions • Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

  21. 3. Sort the hashtags by their counts valtweets = ssc.twitterStream(<username>, <password>) valhashTags= tweets.flatMap(status => getTags(status)) valtagCounts = hashtags .countByValueAndWindow(Minutes(1), Seconds(1)) valsortedTags= tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false)) allows arbitrary RDD operations to create a new DStream

  22. 4. Get the top 20 hashtags valtweets = ssc.twitterStream(<username>, <password>) valhashTags= tweets.flatMap(status => getTags(status)) valtagCounts = hashtags .countByValueAndWindow(Minutes(1), Seconds(1)) valsortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false)) sortedTags.foreach(showTopTags(20) _) output operation

  23. 10 popular hashtags in last 10 min // Create the stream of tweets valtweets = ssc.twitterStream(<username>, <password>) // Count the tags over a 1 minute window valtagCounts= tweets.flatMap(statuts => getTags(status)) .countByValueAndWindow (Minutes(10), Second(1)) // Sort the tags by counts valsortedTags= tagCounts.map { case (tag, count) => (count, tag) } .transform(_.sortByKey(false)) // Show the top 10 tags sortedTags.foreach(showTopTags(10) _)

  24. Demo

  25. Other Operations • Maintaining arbitrary state, tracking sessions tweets.updateStateByKey(tweet => updateMood(tweet)) • Selecting data directly from a DStream tagCounts.slice(<from Time>, <to Time>).sortByKey() t-1 t+2 t+3 t t+1 t+4 tweets user mood

  26. Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

  27. Comparison with others Higher throughput than Storm • Spark Streaming: 670krecords/second/node • Storm: 115k records/second/node • Apache S4: 7.5k records/second/node

  28. Fast Fault Recovery Recovers from faults/stragglers within 1 sec

  29. Real Applications: Conviva Real-time monitoring of video metadata • Implemented Shadoop – a wrapper for Hadoop jobs to run over Spark / Spark Streaming • Ported parts of Conviva’sHadoop stack to run on Spark Streaming Shadoop Hadoop Job valshJob = new SparkHadoopJob[…]( <Hadoop job> ) valshJob.run( <Spark context> ) Spark Streaming

  30. Real Applications: Conviva Real-time monitoring of video metadata • Achieved 1-2 second latency • Millions of video sessions processed scales linearly with cluster size

  31. Real Applications: Mobile Millennium Project Traffic estimation using online machine learning • Markov chain Monte Carlo simulations on GPS observations • Very CPU intensive, requires 10s of machines for useful computation • Scales linearly with cluster size

  32. Failure Semantics • Input data replicated by the system • Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails • Transformations – exactly once • Output operations – at least once

  33. Java API for Streaming • Developed by Patrick Wendell • Similar to Spark Java API • Don’t need to know scala to try streaming!

  34. Contributors • 5 contributors from UCB, 3 external contributors • Matei Zaharia, Haoyuan Li • Patrick Wendell • Denny Britz • Sean McNamara* • Prashant Sharma* • Nick Pentreath* • Tathagata Das

  35. Vision- one stack to rule them all Spark + Spark Streaming

  36. Conclusion Alpha to be release with Spark 0.7 by weekend Look at the new Streaming Programming Guide More about Spark Streaming system in our paper http://tinyurl.com/dstreams Join us in Strata on Feb 26 in Santa Clara

More Related