Spark Streaming

Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) UC BERKELEY

Motivation • Many important applications must process large data streams at second-scale latencies • Check-ins, status updates, site statistics, spam filtering, … • Require large clusters to handle workloads • Require latencies of few seconds

Case study: Conviva, Inc. • Real-time monitoring of online video metadata • Custom-built distributed streaming system • 1000s complex metrics on millions of videos sessions • Requires many dozens of nodes for processing • Hadoop backend for offline analysis • Generating daily and monthly reports • Similar computation as the streaming system Painful to maintain two stacks

Goals • Framework for large-scale stream processing • Scalable to large clusters (~ 100 nodes) with near-real-time latency (~ 1 second) • Efficiently recovers from faults and stragglers • Simple programming model that integrates well with batch & interactive queries Existing system do not achieve all of them

Existing Streaming Systems • Record-at-a-time processing model • Each node has mutable state • For each record, update state & send new records mutable state input records push node 1 node 3 input records node 2

Existing Streaming Systems • Storm • Replays records if not processed due to failure • Processes each record at least once • May update mutable state twice! • Mutable state can be lost due to failure! • Trident – Uses transactions to update state • Processes each record exactly once • Per state transaction updates slow No integration with batch processing & Cannot handle stragglers

Spark Streaming

Discretized Stream Processing • Run a streaming computation as a series of very small, deterministic batch jobs • Batchprocessing models, like MapReduce, recover from faults and stragglers efficiently • Divide job into deterministic tasks • Rerun failed/slow tasks in parallel on other nodes • Same recovery techniques at lower time scales

Spark Streaming • State between batches kept in memory as immutable, fault-tolerant dataset • Specifically as Spark’s Resilient Distributed Dataset • Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency • Potentially combine streaming and batch workloads to build a single unified stack

Discretized Stream Processing batch operations input state / output immutable distributed dataset, stored in memoryas RDD immutable distributed dataset (replicated in memory) time = 0 - 1: input time = 1 - 2: … … … state stream input stream

Fault Recovery • State stored as Resilient Distributed Dataset (RDD) • Deterministically re-computable parallel collection • Remembers lineage of operations used to create them • Fault / straggler recovery is done in parallel on other nodes operation input dataset (replicated and fault-tolerant) state RDD (not replicated) Fast recovery from faults without full data replication

Programming Model • A Discretized Stream or DStream is a series of RDDs representing a stream of data • API very similar to RDDs • DStreams can be created… • Either from live streaming data • Or by transforming other DStreams

DStream Data Sources • Many sources out of the box • HDFS • Kafka • Flume • Twitter • TCP sockets • Akka actor • ZeroMQ • Easy to add your own Contributed by external developers

Transformations Build new streams from existing streams • RDD-like operations • map, flatMap, filter, count, reduce, • groupByKey, reduceByKey, sortByKey, join • etc. • New window and stateful operations • window, countByWindow, reduceByWindow • countByValueAndWindow, reduceByKeyAndWindow • updateStateByKey • etc.

Output Operations Send data to outside world • saveAsHadoopFiles • print – prints on the driver’s screen • foreach - arbitrary operation on every RDD

Example Process a stream of Tweets to find the 20 most popular hashtags in the last 10 mins • Get the stream of Tweets and isolate the hashtags • Count the hashtags over 10 minute window • Sort the hashtags by their counts • Get the top 20 hashtags

1. Get the stream of Hashtags valtweets = ssc.twitterStream(<username>, <password>) valhashtags= tweets.flatMap(status => getTags(status)) DStream transformation = RDD t-1 t+2 t+3 t t+1 t+4 tweets flatMap flatMap flatMap flatMap flatMap hashTags

2. Count the hashtags over 10 min sliding window operation valtweets = ssc.twitterStream(<username>, <password>) valhashtags= tweets.flatMap(status => getTags(status)) valtagCounts= hashTags.window(Minutes(10), Seconds(1)) .map(tag => (tag, 1)).reduceByKey(_ + _) t-1 t+2 t+3 t t+1 t+4 hashTags tagCounts

2. Count the hashtags over 10 min valtweets = ssc.twitterStream(<username>, <password>) valhashtags= tweets.flatMap(status => getTags(status)) valtagCounts= hashtags .countByValueAndWindow(Minutes(10), Seconds(1)) t-1 t+2 t+3 t t+1 t+4 hashTags – + + tagCounts

Smart window-based reduce • Technique with count generalizes to reduce • Need a function to “subtract” • Applies to invertible reduce functions • Could have implemented counting as: hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

3. Sort the hashtags by their counts valtweets = ssc.twitterStream(<username>, <password>) valhashTags= tweets.flatMap(status => getTags(status)) valtagCounts = hashtags .countByValueAndWindow(Minutes(1), Seconds(1)) valsortedTags= tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false)) allows arbitrary RDD operations to create a new DStream

4. Get the top 20 hashtags valtweets = ssc.twitterStream(<username>, <password>) valhashTags= tweets.flatMap(status => getTags(status)) valtagCounts = hashtags .countByValueAndWindow(Minutes(1), Seconds(1)) valsortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false)) sortedTags.foreach(showTopTags(20) _) output operation

10 popular hashtags in last 10 min // Create the stream of tweets valtweets = ssc.twitterStream(<username>, <password>) // Count the tags over a 1 minute window valtagCounts= tweets.flatMap(statuts => getTags(status)) .countByValueAndWindow (Minutes(10), Second(1)) // Sort the tags by counts valsortedTags= tagCounts.map { case (tag, count) => (count, tag) } .transform(_.sortByKey(false)) // Show the top 10 tags sortedTags.foreach(showTopTags(10) _)

Demo

Other Operations • Maintaining arbitrary state, tracking sessions tweets.updateStateByKey(tweet => updateMood(tweet)) • Selecting data directly from a DStream tagCounts.slice(<from Time>, <to Time>).sortByKey() t-1 t+2 t+3 t t+1 t+4 tweets user mood

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

Comparison with others Higher throughput than Storm • Spark Streaming: 670krecords/second/node • Storm: 115k records/second/node • Apache S4: 7.5k records/second/node

Fast Fault Recovery Recovers from faults/stragglers within 1 sec

Real Applications: Conviva Real-time monitoring of video metadata • Implemented Shadoop – a wrapper for Hadoop jobs to run over Spark / Spark Streaming • Ported parts of Conviva’sHadoop stack to run on Spark Streaming Shadoop Hadoop Job valshJob = new SparkHadoopJob[…]( <Hadoop job> ) valshJob.run( <Spark context> ) Spark Streaming

Real Applications: Conviva Real-time monitoring of video metadata • Achieved 1-2 second latency • Millions of video sessions processed scales linearly with cluster size

Real Applications: Mobile Millennium Project Traffic estimation using online machine learning • Markov chain Monte Carlo simulations on GPS observations • Very CPU intensive, requires 10s of machines for useful computation • Scales linearly with cluster size

Failure Semantics • Input data replicated by the system • Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails • Transformations – exactly once • Output operations – at least once

Java API for Streaming • Developed by Patrick Wendell • Similar to Spark Java API • Don’t need to know scala to try streaming!

Contributors • 5 contributors from UCB, 3 external contributors • Matei Zaharia, Haoyuan Li • Patrick Wendell • Denny Britz • Sean McNamara* • Prashant Sharma* • Nick Pentreath* • Tathagata Das

Vision- one stack to rule them all Spark + Spark Streaming

Conclusion Alpha to be release with Spark 0.7 by weekend Look at the new Streaming Programming Guide More about Spark Streaming system in our paper http://tinyurl.com/dstreams Join us in Strata on Feb 26 in Santa Clara

Spark Streaming

Spark Streaming

Presentation Transcript

Spark

Spark Streaming Large-scale near-real-time stream processing

Spark

Spark

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Spark

Spark Streaming Preview

Spark

Spark

Real-time PMML Scoring over Spark Streaming and Storm

Spark

Spark

Intro to Spark 0.7: PySpark and Streaming

Spark Streaming Large -scale near-real-time stream processing

Spark

Spark

Spark

SPARK

StreamAnalytix | Real-Time Big Data Streaming Analytics, Apache Spark Streaming

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

Spark

Spark streaming 的监控和优化