MapReduce Online

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO kxmo@cse.ust.hk

Outline • Background • Motivation: Block vspipeline • Hadoop Online Prototype Model • Pipeline within job • Online Aggregation • Continuous Queries • Conclusion

Background • Map Reduce system • Massive data parallelism, batch-oriented, high through put • Fault tolerance via putting results on HDFS • However, • Stream process: analyze streams of data • Online aggregation: used interactively

Motivation: Batch vs online • Batch: • Reduce begin after all map task • High through put, high latency • Online • Stream process is usually not fault tolerant • Lower latency • Blocking does not fit online/stream data • Final answers only • No infinite streams • Fault tolerance is important, how?

Map Reduce Job • Map step • Parse input into words • For each word, output <word,1> • Reduce step • For each word, list of counts • Sum counts and output <word,sum> • Combine step (optional) • Preaggregate map output • Same with reduce

Map Reduce steps • Client Submit job, master schedule mapper and reducer • Map step, Group(sort) step, Combine step(optional), • Commit step • Map finish

Map Reduce steps • Master tell reduce the output location • Shuffle(pull data) step, Group(all sort) step, • Start too late? • Reduce step • Reduce finish • Job finish

Hadoop Online Prototype(HOP) • Major: Pipelining between operators • Data pushed from mapper to reducer • Data transfer concurrently with map/reduce computation • Still fault tolerant • Benefit • Low latency • Higher Utilization • Smooth network traffic

Performance at a Glance • In some case, HOP can reduce job completion time by 25%.

Pipeline within Job • Simple design: pipeline each record • Prevents map to groupand combine • Network I/O heavy load • Map flood and bury reducer • Revised: pipeline small sorted runs(spills) • Task thread: apply map/reduce function, buffer output • Spill thread: sort & combine buffer, spill to file • Task Tracker: handle service consumer requests

Utility balance control • Mapper send early results: move computation(group&combine) from mapper to reducer. • If reducer is fast, mapper aggressive++, mapper sort&spill-- • If reducer is slow, mapper aggressive--, mapper sort&spill++ • Halt pipeline when: backup or effective combiner • Resume pipeline by: merge&combine accumulated spill files

Pipelined Fault tolerant(PFT) • Simple PFT design (coarse): • Reduce treats in-progress map output as tentative • If map succeed accept its output • If map die, throw its output • Revised PTF design (finer): • Record mapper progress, recover from latest checkpoint • Correctness: Reduce task ensure spill files are good • Map tasks recover from latest checkpoint, no redundant spill file • Master is more busy here • Need to record progress for each map, • Need to record whether each map output is send

System fault tolerant • Mapper fail • New mapper start from checkpoint and sent to reducer • Reducer fail • All mapper resend all intermediate result. Mapper still need to store the intermediate result on local disk, but reducer don’t have to block. • Master fail • The system cannot survive.

Online aggregation • Show snapshot of reducer result from time to time • Show Progress (reducer get %)

Pipeline between jobs • Assume we run job1, job2. job2 needs job1’s result. • Snapshot the output of job1 and pipeline it to job2 from time to time. • Fault tolerant: • Job1 fail: recover as before • Job2 fail: restart failed task • Both fail: job2 restart from latest snapshot

Continuous Queries • Mapper: add flush API, store it locally if reducer is unavailable • Reducer: run periodically • Wall-clock time, logical time, #input rows, etc • Fix #mapper and #reducers • Fault tolerant: • Mapper cannot retain infinite results. • Reducer: saving checkpoint using HDFS

Performance Evaluation

Impact of #Reducer

Impact of #Reducer • When #reducer is enough, faster. • When #reducer is not enough, slower. • Not able to balance workload between mapper and reducer

Small vs Large block • When using large block, HOP is faster because reducer doesn’t wait.

Small vs Large block • When using small block, HOP is still faster, but advantage is smaller.

Discussion • HOP improved hadoop for real time/stream process, useful with few jobs. • Finer granularity control may make task master busy, affect scalability. • When there is a lot of jobs, it may increase computation and decrease through put. (busier network, many overhead for master)

MapReduce Online