260 likes | 395 Views
‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery. Florin Dinu T. S. Eugene Ng Rice University. is Widely Used . *. Image Processing . Web Indexing. Machine Learning. Protein
E N D
‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University
is Widely Used * Image Processing Web Indexing Machine Learning Protein Sequencing Advertising Analytics Log Storage and Analysis Recent research work 2010 * Source: http://wiki.apache.org/hadoop/PoweredBy
Compute-Node FailuresAre Common “ ... typical that 1,000 individual machine failureswill occur; thousands of hard drive failureswill occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours” Jeff Dean – Google I/O 2008 “ 5.0 average worker deaths per job” Jeff Dean – Keynote I – PACT 2006 Revenue Reputation User experience
vs Compute-node failures are common and damaging is widely used How does behave under compute-node failures? Inflated, variable and unpredictable job running times. Sluggish failure detection. What are the design decisions responsible? Answer in this work.
Focus of This Work Task Tracker • Task Tracker failures • Loss of intermediate data • Loss of running tasks • Data Nodes not failed • Types of failures • Task Tracker process fail-stop failures • Task Tracker node fail-stop failures • Single failures • Expose mechanisms and their interactions • Findings also apply to multiple failures Reducer Mapper Data Node Job Tracker Name Node
Declaring a Task Tracker Dead Heartbeats from Task Tracker to Job Tracker Usually every 3s <200 <600 <400 >600 Time 200s Restart running tasks Restart completed maps Job Tracker checks if heartbeats not sent for at least 600s Conservative design
Declaring a Task Tracker Dead <200 <600 <400 >600 Time Detection time ~ 800s <200 <600 <400 >600 Time Detection time ~ 600s Variable failure detection time
Declaring Map Output Lost • Uses notifications from running reducers to Job Tracker • A message that a specific map output is unavailable • Restart map M to re-compute its lost output • #notif(M) > (0.5* #running reducers) and #notif(M) > 3 Job Tracker X Time <400 <600 >600 <200 Conservative design Static parameters
Reducer Notifications • Signals a specific map output is unavailable • On connection error (R1) • re-attempt connection • send notification when • nr of attempts % 10 = 0 • exponential wait between attempts • wait = 10*(1.3)^(nr_failed_attempts) • usually 416s needed for 10 attempts • On read error (R2) • send notification immediately Job Tracker R1 X X M5 R2 Conservative design Static parameters
Declaring a Reducer Faulty • Reducer faulty if (simplified version): • #shuffles failed > 0.5* #shuffles attempted • and • #shuffles succeeded < 0.5* #shuffles necessary • or • reducer stalled for too long X Ignores cause of failed shuffles. Static parameters
Experiment: Methodology • 15-node, 4-rack testbed in the OpenCirrus* cluster • 14 compute nodes, 1 reserved for Job Tracker and Name Node • Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment • Job takes 220s in the absence of failures • Inject single Task Tracker process failure randomly between 0 and 220s * https://opencirrus.org/the HP/Intel/Yahoo! Open Cloud Computing Research Testbed
Experiment: Results Large variability in job running times
Experiment: Results Group G2 Group G1 Group G4 Group G3 Group G5 Group G6 Group G7 Large variability in job running times
Group G1 – few reducers impacted M1 copied by all reducers before failure. After failure R1_1 cannot access M1. R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after 600-800s Job Tracker R1 M2 R2 M3 Notif (M1) R3 X M1 R1_1 Slow recovery when few reducers impacted
Group G2 – timing of failure 200s difference between G1 and G2. Job end 200s 170s G1 Time 600s G2 Time 170s 600s Timing of failure relative to Job Tracker checks impacts job running time
Group G3 – early notifications • G1 notifications sent after 416s • G3 early notifications => map outputs declared lost • Causes: • Code-level race conditions • Timing of a reducer’s shuffle attempts Regular notification (416s) 0 1 2 3 4 5 6 Early notification (<416s) 0 1 2 3 4 5 6 Early notifications increase job running time variability
Group G4 & G5 – many reducers impacted G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 - Same as G4 but early notifications are sent Job Tracker R1 M2 R2 M3 Notif (M1,M2,M3,M4,M5) X R3 M1 R1_1 Job running time under failure varies with nr of reducers impacted
Induced Reducer Death • Reducer faulty if (simplified version): • #shuffles failed • ------------------------------ > 0.5 • #shuffles attempted • and • #shuffles succeeded • ------------------------------ < 0.5 or stalled for too long • #shuffles necessary X • If failed Task Tracker is contacted among first Task Trackers => the reducer dies • If failed Task Tracker is attempted too many times => the reducer dies A failure can induce other failures in healthy reducers. CPU time and network bandwidth are unnecessarily wasted.
56 vs 14 Reducers CDF Job running times are spread out even more Increased chance for induced reducer death or early notifications
Simulating Node Failure CDF Without RST packets all affected tasks wait for Task Tracker to be declared dead.
Lack of Adaptivity • Recall: • Notification sent after 10 attempts • Inefficiency: • A static, one size fits all solution cannot handle all situations • Efficiency varies with number of reducers • A way forward: • Use more detailed information about current job state
Conservative Design • Recall: • Declare a Task Tracker dead after at least 600s • Send a notification after 10 attempts and 416 seconds • Inefficiency: • Assumes most problems are transient • Sluggish response to permanent compute-node failure • A way forward: • Additional information should be leveraged • Network state information • Historical information of compute-node behavior [OSDI ‘10]
Simplistic Failure Semantics • Lack of TCP connectivity = problem with tasks • Inefficiency: • Cannot distinguish between multiple causes for lack of connectivity • Transient congestion • Compute-node failure • A way forward: • Decouple failure recovery from overload recovery • Use AQM/ECN to provide extra congestion information • Allow direct communication between application and infrastructure
Thank you Company and product logos from company’s website. Conference logos from the conference websites. Links to images: http://t0.gstatic.com/images?q=tbn:ANd9GcTQRDXdzM6pqTpcOil-k2d37JdHnU4HKue8AKqtKCVL5LpLPV-2 http://www.scanex.ru/imgs/data-processing-sample1.jpg http://t3.gstatic.com/images?q=tbn:ANd9GcQSVkFAbm-scasUkz4lQ-XlPNkbDX9SVD-PXF4KlGwDBME4ugxc http://criticalmas.com/wp-content/uploads/2009/07/the-borg.jpg http://www.brightermindspublishing.com/wp-content/uploads/2010/02/advertising-billboard.jpg http://www.planetware.com/i/photo/logs-stacked-in-port-st-lucie-fla513.jpg
Group G3 – early notifications • G1 notifications sent after 416s • G3 early notifications => map outputs declared lost • Causes: • Code-level race conditions • Timing of a reducer’s shuffle attempts M5-1 M6-1 M5-2 M6-2 M5-3 M6-3 M5-4 M6-4 R2 M5 X 0 1 2 3 4 5 6 M6 M5-4 M6-5 R2 M5-3 M6-4 M6-1 M5-2 M6-3 M5-1 M6-2 M5 X M6 0 1 2 3 4 5 6 Early notifications increase job running time variability
Task Tracker Failure-Related Mechanisms Declaring a Task Tracker dead Declaring a map output lost Declaring a reducer faulty