320 likes | 401 Views
On Availability of Intermediate Data in Cloud Computations. Steven Y. Ko , Imranul Hoque , Brian Cho, and Indranil Gupta Distributed Protocols Research Group (DPRG) University of Illinois at Urbana-Champaign. Our Position.
E N D
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, ImranulHoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group (DPRG) University of Illinois at Urbana-Champaign
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a solution
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a solution • This talk • Builds up the case • Emphasizes the need, not the solution
Dataflow Programming Frameworks • Runtime systems that execute dataflow programs • MapReduce (Hadoop), Pig, Hive, etc. • Gaining popularity for massive-scale data processing • Distributed and parallel execution on clusters • A dataflow program consists of • Multi-stage computation • Communication patterns between stages
Example 1: MapReduce • Two-stage computation with all-to-all comm. • Google introduced, Yahoo! open-sourced (Hadoop) • Two functions – Map and Reduce – supplied by a programmer • Massively parallel execution of Map and Reduce Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce
Example 2: Pig and Hive • Pig from Yahoo! & Hive from Facebook • Built atop MapReduce • Declarative, SQL-style languages • Automatic generation & execution of multiple MapReduce jobs
Example 2: Pig and Hive • Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 1-to-1 comm. Stage 3: Map Stage 4: Reduce
Usage • Google (MapReduce) • Indexing: a chain of 24 MapReduce jobs • ~200K jobs processing 50PB/month (in 2006) • Yahoo! (Hadoop + Pig) • WebMap: a chain of 100 MapReduce jobs • Facebook (Hadoop + Hive) • ~300TB total, adding 2TB/day (in 2008) • 3K jobs processing 55TB/day • Amazon • Elastic MapReduce service (pay-as-you-go) • Academic clouds • Google-IBM Cluster at UW (Hadoop service) • CCT at UIUC (Hadoop & Pig service)
One Common Characteristic • Intermediate data • Intermediate data? data between stages • Similarities to traditional intermediate data • E.g., .o files • Critical to produce the final output • Short-lived, written-once and read-once, & used-immediately
One Common Characteristic • Intermediate data • Written-locally & read-remotely • Possibly very large amount of intermediate data (depending on the workload, though) • Computational barrier Stage 1: Map Computational Barrier Stage 2: Reduce
Computational Barrier + Failures • Availability becomes critical. • Loss of intermediate data before or during the execution of a task=> the task can’t proceed Stage 1: Map Stage 2: Reduce
Current Solution • Store locally & re-generate when lost • Re-run affected map & reduce tasks • No support from a storage system • Assumption: re-generation is cheap and easy Stage 1: Map Stage 2: Reduce
Hadoop Experiment • Emulab setting (for all plots in this talk) • 20 machines sorting 36GB • 4 LANs and a core switch (all 100 Mbps) • Normal execution: Map–Shuffle–Reduce Map Shuffle Reduce
Hadoop Experiment • 1 failure after Map • Re-execution of Map-Shuffle-Reduce • ~33% increase in completion time Map Shuffle Reduce Map Shuffle Reduce
Re-Generation for Multi-Stage • Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce
Importance of Intermediate Data • Why? • Critical for execution (barrier) • When lost, very costly • Current systems handle it themselves. • Re-generate when lost: can lead to expensive cascaded re-execution • No support from the storage • We believe the storage is the right abstraction, not the dataflow frameworks.
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a solution • Why is storage the right abstraction? • Challenges • Research directions
Why is Storage the Right Abstraction? • Replication stops cascaded re-execution. Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce
So, Are We Done? • No! • Challenge: minimal interference • Network is heavily utilized during Shuffle. • Replication requires network transmission too. • Minimizing interference is critical for the overall job completion time. • Any existing approaches? • HDFS (Hadoop’s default file system): much interference (next slide) • Background replication with TCP-Nice: not designed for network utilization & control (no further discussion, please refer to our paper)
Modified HDFS Interference • Unmodified HDFS • Much overhead with synchronous replication • Modification for asynchronous replication • With an increasing level of interference • Four levels of interference • Hadoop: original, no replication, no interference • Read: disk read, no network transfer, no actual replication • Read-Send: disk read & network send, no actual replication • Rep.: full replication
Modified HDFS Interference • Asynchronous replication • Network utilization makes the difference • Both Map & Shuffle get affected • Some Maps need to read remotely
Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a new storage system design • Why is storage the right abstraction? • Challenges • Research directions
Research Directions • Two requirements • Intermediate data availability to stop cascaded re-execution • Interference minimization focusing on network interference • Solution • Replication with minimal interference
Research Directions • Replication using spare bandwidth • Not much network activity during Map & Reduce computation • Tight B/W monitoring & control • Deadline-based replication • Replicate every N stages • Replication based on a cost model • Replicate only when re-execution is more expensive
Summary • Our position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Problem: cascaded re-execution • Requirements • Intermediate data availability • Interference minimization • Further research needed
Default HDFS Interference • Replication of Map and Reduce outputs
Default HDFS Interference • Replication policy: local, then remote-rack • Synchronous replication