HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters Presentation by Amr Swafta

Outlines • Introduction / Motivation • Iterative Application Example • HaLoop Architecture • Task Scheduling • Caching and Indexing • Experiments & Results • Conclusion

Introduction / Motivation • HaLoop: is a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications. • MapReduce framework can’t directly support recursion/iteration. • Many data analysis techniques require iterative computations: • PageRank • Clustering • Neural-network analysis • Social network analysis

Iterative Application Example • PageRank algorithm: system for ranking web pages. • Where: - PR(A): is the PageRank of page A. - PR(Ti): is the PageRank of pages Ti which link to page A. - C(Ti): is the number of outbound links on page Ti. - d: is a damping factor which can be set between 0 and 1. PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))

Consider a small web consisting of three pages A, B and C with d= 0.5. • The PageRank will be calculated as the following: PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

HaLoop Architecture

HaLoop’s master node contains a new loop controlmodule that repeatedly starts new map-reduce steps that compose the loop body. • HaLoop uses a modified task scheduler for iterative applications. • HaLoop caches and indexes application data on slave nodes.

Different between Hadoop and HaLoop with iterative applications. • Note: The loop control is pushed from the application into the infrastructure.

Task Scheduling • Inter-iteration locality: place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data. • In order to cached data reused between iterations. • The scheduling exhibits inter-iteration locality if: For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node d: mapper / reducer input. T: task which consumes (d) during iterations.

- Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.

Caching and Indexing • To reduce I/O cost, HaLoop caches loop-invariant data partitions on the physical node’s local disk for subsequent re-use. • To further accelerate processing, it indexes the cached data. - Keys and values stored in separate local files. • Type of caches: - Reducer Input Cache - Reducer Output Cache - Mapper Input Cache

Reducer Input Cache … • Access to loop invariant data without map/shuffle. • RI cached data is used By reducer function. • Assumes: • Mapper output is constant across iterations. • Static partitioning (implies: no new nodes). • In HaLoop, the number of reducer tasks is unchanched across iterations.

Reducer Output Cache … • Stores and indexes the most recent local output on each reducer node. • Provides distributed access to output of previous iterations. • RO cached data is used by Fixpoint evaluation. • It’s very efficient when the fixpoint evaluation should be conducting after each iteration.

Mapper Input Cache … • In the first iteration, if a mapper performs a non-local read on an input split, the split will be cached in the local disk of the mapper’s physical node. • In later iterations, all mappers read data only from local disks. • MI cached data is used during scheduling of map tasks.

Cache Reloading 1- The hosting node fails. 2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.

Experiments & Results • HaLoop is evaluated on real queries and real datasets. • Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

Evaluation of Reducer Input Cache • Overall runtime.

Reduce and Shuffle

Evaluation of Reducer Output Cache • Time spent on fixpoint evaluation in each iteration. Fixpoint evaluation (s) Iteration # Iteration # Livejournal dataset 50 Nodes Freebase dataset 90 Nodes

Evaluation of Mapper Input Cache • Overall runtime. Cosmo-dark 8 Nodes Cosmo-gas 8 Nodes

Conclusion • Authors present the design, implementation, and evaluation of HaLoop, a novel parallel and distributed system that supports large-scale iterative data analysis applications. • HaLoop is built on top of Hadoop and extends it with a several important optimizations that include: - A loop-aware task scheduler - Loop-invariant data caching - Caching for efficient fixpoint verification.

Questions

Thank You

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters