Spark Streaming Preview

Spark Streaming Preview Fault-Tolerant Stream Processing at Scale Matei Zaharia, Tathagata Das,Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY

Motivation • Many important applications need to process large data streams arriving in real time • User activity statistics (e.g. Facebook’s Puma) • Spam detection • Traffic estimation • Network intrusion detection • Our target: large-scale apps that need to run on tens-hundreds of nodes with O(1 sec) latency

System Goals • Simple programming interface • Automatic fault recovery (including state) • Automatic straggler recovery • Integration with batch & ad-hoc queries(want one API for all your data analysis)

Traditional Streaming Systems • “Record-at-a-time” processing model • Each node has mutable state • Event-driven API: for each record, update state and send out new records mutable state input records push node 1 node 3 input records node 2

Challenges with Traditional Systems • Fault tolerance • Either replicate the whole system (costly) or use upstream backup (slow to recover) • Stragglers (typically not handled) • Consistency (few guarantees across nodes) • Hard to unify with batch processing

Our Model: “Discretized Streams” • Run each streaming computation as a series of very small, deterministic batch jobs • E.g. a MapReduce every second to count tweets • Keep state in memory across jobs • New Spark operators allow “stateful” processing • Recover from faults/stragglers in same way as MapReduce (by rerunning tasks in parallel)

Discretized Streams in Action batch operation t = 1: input immutable dataset(output or state); stored in memoryas Spark RDD immutable dataset(stored reliably) t = 2: input … … … stream 2 stream 1

Example: View Count • Keep a running count of views to each webpage views = readStream("http:...", "1s") ones = views.map(ev => (ev.url, 1)) counts = ones.runningReduce(_ + _) views ones counts t = 1: map reduce t = 2: . . . = dataset = partition

Fault Recovery • Checkpoint state datasets periodically • If a node fails/straggles, build its data in parallel on other nodes using dependency graph map output dataset input dataset Fast recovery without the cost of full replication

How Fast Can It Go? • Currently handles 4 GB/s of data (42 million records/s) on 100 nodes at sub-second latency • Recovers from failures/stragglers within 1 sec

Outline • Introduction • Programming interface • Implementation • Early results • Future development

D-Streams • A discretized stream is a sequence of immutable, partitioned datasets • Specifically, each dataset is an RDD (resilient distributed dataset), the storage abstraction in Spark • Each RDD remembers how it was created, and can recover if any part of the data is lost

D-Streams • D-Streams can be created… • either from live streaming data • or by transforming other D-streams • Programming with D-Streams is very similar to programming with RDDs in Spark

D-Stream Operators • Transformations • Build new streams from existing streams • Include existing Spark operators, which act on each interval in isolation, plus new “stateful” operators • Output operators • Send data to outside world (save results to external storage, print to screen, etc)

Example 1 Count the words received every second words = readStream("http://...", Seconds(1)) counts = words.count() D-Streams transformation words counts time = 0 - 1: count = RDD time = 1 - 2: count time = 2 - 3: count

Demo • Setup • 10 EC2 m1.xlarge instances • Each instance receiving a stream of sentences at rate of 1 MB/s, total 10 MB/s • Spark Streaming receives the sentences and processes them

Example 2 Count frequency of words received every second words = readStream("http://...", Seconds(1)) ones = words.map(w => (w, 1)) freqs = ones.reduceByKey(_ + _) Scala function literal words freqs ones time = 0 - 1: map reduce time = 1 - 2: time = 2 - 3:

Demo

Example 3 Count frequency of words received in last minute ones = words.map(w => (w, 1)) freqs = ones.reduceByKey(_ + _) freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _) sliding window operator window length window movement words freqs ones time = 0 - 1: freqs_60s map reduce window reduce time = 1 - 2: time = 2 - 3:

Simpler running reduce freqs= ones.reduceByKey(_ + _) freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _) freqs= ones.reduceByKeyAndWindow(_ + _,Seconds(60),Seconds(1))

Demo

“Incremental” window operators words freqs freqs_60s t-1 t words freqs freqs_60s t+1 t-1 t+2 t t+3 t+1 + t+4 – t+2 t+3 Aggregation function + t+4 + Invertible aggregation function

Smarter running reduce freqs= ones.reduceByKey(_ + _) freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _) freqs= ones.reduceByKeyAndWindow(_ + _,Seconds(60),Seconds(1)) freqs = ones.reduceByKeyAndWindow( _ + _,_ - _,Seconds(60),Seconds(1))

Output Operators • save: write results to any Hadoop-compatible storage system (e.g. HDFS, HBase) • foreachRDD: run a Spark function on each RDD freqs.save(“hdfs://...”) words.foreachRDD(wordsRDD => { // any Spark/scala processing, maybe save to database })

Live + Batch + Interactive • Combining D-streams with historical datasets pageViews.join(historicCounts).map(...) • Interactivequeries on stream state from the Spark interpreter pageViews.slice(“21:00”, “21:05”).topK(10)

System Architecture Built on an optimized version of Spark Worker Client Input receiver Task execution Master Block manager Client Replication of input & checkpoint RDDs D-streamlineage Worker Task scheduler Block tracker Input receiver Client Task execution Block manager

Implementation Optimizations on current Spark: • New block store • APIs: Put(key, value, storage level), Get(key) • Optimized scheduling for <100ms tasks • Bypass Mesos cluster scheduler (tens of ms) • Fast NIO communication library • Pipelining of jobs from different time intervals

Evaluation • Ran on up to 100 “m1.xlarge” machines on EC2 • 4 cores, 15 GB RAM each • Three applications: • Grep: count lines matching a pattern • Sliding word count • Sliding top K words

Scalability Maximum throughput possible with 1s or 2s latency 100-byte records (100K-500K records/s/node)

Performance vs Storm and S4 • Storm limited to 10,000 records/s/node • Also tried S4: 7000 records/s/node • Commercial systems report 100K aggregated

Fault Recovery • Recovers from failures within 1 second Sliding WordCount on 10 nodes with 30s checkpoint interval

Fault Recovery Failures: Stragglers:

Interactive Ad-Hoc Queries

Future Development • An alpha of discretized streams will go into Spark by the end of the summer • Engine improvements from Spark Streaming project are already there (“dev” branch) • Together, make Spark to a powerful platform for both batch and near-real-time analytics

Future Development • Other things we’re working on/thinking of: • Easier deployment options (standalone & YARN) • Hadoop-based deployment (run as Hadoop job)? • Run Hadoop mappers/reducers on Spark? • Java API? • Need your feedback to prioritize these!

More Details • You can find more about Spark Streaming in our paper: http://tinyurl.com/dstreams

Related Work • Bulk incremental processing (CBP, Comet) • Periodic (~5 min) batch jobs on Hadoop/Dryad • On-disk, replicated FS for storage instead of RDDs • Hadoop Online • Does not recover stateful ops or allow multi-stage jobs • Streaming databases • Record-at-a-time processing, generally replication for FT • Approximate query processing, load shedding • Do not support the loss of arbitrary nodes • Different math because drop rate is known exactly • Parallel recovery (MapReduce, GFS, RAMCloud, etc)

Timing Considerations • D-streams group input into intervals based on when records arrive at the system • For apps that need to group by an “external” time and tolerate network delays, support: • Slack time: delay starting a batch for a short fixed time to give records a chance to arrive • Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5

Spark Streaming Preview

Spark Streaming Preview

Presentation Transcript

Spark

Spark Streaming Large-scale near-real-time stream processing

Spark

Spark

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Spark

Spark

Spark

Real-time PMML Scoring over Spark Streaming and Storm

Spark

Spark Streaming

Intro to Spark 0.7: PySpark and Streaming

Spark Streaming Large -scale near-real-time stream processing

Spark

Spark

Preview & Streaming ] Blues vs Chiefs Live online

Preview & Streaming ] Bulls vs Stormers Live online

Preview & Streaming Sharks vs Cheetahs Live online

StreamAnalytix | Real-Time Big Data Streaming Analytics, Apache Spark Streaming

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

Spark

Spark streaming 的监控和优化