210 likes | 340 Views
Fault Tolerant Parallel Data-Intensive Algorithms. †. Mucahid Kutlu Gagan Agrawal Oguz Kurt Department of Computer Science and Engineering The Ohio State University Department of Mathematics The Ohio State University. †. Outline. Motivation Related Work Our Goal
E N D
Fault Tolerant Parallel Data-Intensive Algorithms † Mucahid Kutlu Gagan Agrawal Oguz Kurt Department of Computer Science and Engineering The Ohio State University Department of Mathematics The Ohio State University † HIPC'12 Pune, India
Outline HIPC'12 Pune, India • Motivation • Related Work • Our Goal • Data Intensive Algorithms • Our Approach • Data Distribution • Fault Tolerant Algorithms • Recovery • Experiments • Conclusion
Why Is Fault Tolerant So Important? • Typical first year for a new cluster* • 1000 individual machine failures • 1 PDU failure (~500-1000 machines suddenly disappear) • 20 rack failures (40-80 machines disappear,1-6 hours to get back) • Other failures because of overheating, maintenance, … • If you have a code which runs on 1000 computers more than one day, it is high probability that you will get a failure before your code finishes. *taken from Jeff Dean’s talk in Google IO(http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx) HIPC'12 Pune,India
Fault Tolerance So Far • MPI fault-tolerance, which focus on checkpointing [1], [2], [3]. • High overhead • MapReduce[4] • Good at fault tolerance for data-intensive applications • Algorithm-based fault-tolerance • Most of them are for scientific computation like linear algebra routines [5], [6], iterative computations [7], including conjugate gradient [8]. HIPC'12 Pune,India
Our Goal HIPC'12 Pune,India • Our main goal is to develop algorithm-based fault-tolerance solution for data intensive algorithms. • Target Failure : • We focused on hardware failures. • We lose everything on the failed node. • We do not recover that failed node and continue process without it. • In an iterative algorithm, when a failure occurs: • The system shouldn’t start the failed iteration from the beginning • The amount of data lost should be small as much as possible.
Data Intensive Algorithms HIPC'12 Pune,India We focused on two algorithms: K-Means & Apriori But it can be generalized to all algorithms that have the following reduction processing structure.
Our Approach HIPC'12 Pune,India • We used Master-Slave approach. • Replication : Divide the data to be replicated in parts, and distribute them among different processors. • The amount of lost data will be smaller. • Summarization : The slaves can send the results of parts of their data before they processed all the data. • We won’t need to re-process the data that we already got the results.
Data Distribution D1 D2 D3 D4 D1 D2 D3 D4 Replication D5 D6 D7 D8 D5 D6 D7 D8 Master P1 P2 P3 P4
K-means with No Failure 1- 14 Master 1 2 Send the summary of data portion 1 Broadcast new centroids P1 P2 P3 P4 P5 P7 P6 1 6 5 2 3 4 7 13 14 1 2 3 4 5 6 7 8 11 12 Primary Data 9 10 7 3 2 1 1 2 6 5 6 3 4 7 8 1 2 11 12 1 2 3 4 Replicas 7 5 5 6 7 3 4 13 14 9 10 11 12 9 10 13 14 5 6 7 8
Single Node Failure Case 1: P1 Fails after sending the result of D1 Recovery -Master node notifies P3 to process D2 at this iteration and P3 to process D1 starting from next iteration. Master P1 P2 P3 P4 P5 P7 P6 13 14 1 2 3 4 5 6 7 8 11 12 Primary Data 9 10 5 6 3 4 7 8 1 2 11 12 1 2 3 4 Replicas 13 14 9 10 11 12 9 10 13 14 5 6 7 8 HIPC'12 Pune,India
Multiple Node Failure Case 2: P1, P2 and P3 Fail, D1 and D2 are lost totally! Recovery -Master node notifies P7 to read first data block(D1 and D2) from storage cluster. -Master Node notifies P6 to processD5 and D6, P4 to process D3, P5 to process D4. -P7 reads D1 and D2 Storage Cluster Master P1 P2 P3 P4 P5 P7 P6 13 14 1 2 3 4 5 6 7 8 11 12 Primary Data 9 10 5 6 3 4 7 8 1 2 11 12 1 2 3 4 Replicas 13 14 9 10 11 12 9 10 13 14 5 6 7 8 1 2 HIPC'12 Pune,India
Experiments HIPC'12 Pune,India • We used Glenn at OSC for our experiments. • The nodes have : • Dual socket, quad core 2.5 GHz Opterons • 24 GB RAM • Allocated 16 slave nodes with 1 core per each. • Implemented in C programming language by using MPI library. • Generated Datasets: • K-Means • Size : 4.87 GB • Coordinate Number : 20 • Maximum Iteration : 50 • Apriori • Size : 4.79 GB • Item Number : 10 • Support Vector :%1 • Maximum Rule Size :6 (847 rules)
Effect of Summarization Changing Message Numbers with Different Percentages Changing Data Block Numbers with Different Number of Failures HIPC'12 Pune,India
Effect of Replication & Scalability Scalability Test for Apriori Effect of Replication for Apriori HIPC'12 Pune,India
Fault Tolerance in Map Reduce HIPC'12 Pune,India • Replication data in file system • We replicate the data in processors, not in file system • If a task fails, re-execute it. • Completed map tasks also need to be re-executed since the results are stored in local disks. • Because of dynamics scheduling, it can have better parallelism after a failure.
Experiments with Hadoop HIPC'12 Pune,India • Experimental Setup • Allocated 17 nodes and 1 core per each. • Used one of nodes as master and the rest as slaves • Used default chunk size • No backup nodes are used • Replication is set to 3 • Each test executed 5 times and took the average of 3 after eliminating the maximum and minimum results. • For our system, we set R=3 and S=4 and M=2 • Calculated total time, including I/O operation
Experiments with Hadoop(Cont’d) Single Failure Occurring at different Percentages Multiple Failure Test for Apriori HIPC'12 Pune,India
Conclusion HIPC'12 Pune,India • Summarization has good effect in fault tolerance • We recover faster when we divide the data into more parts • Dividing data into small parts and distributing them with minimum intersection decrease the amount of data to be recovered in multiple failures. • Our system behaves much better Hadoop
References HIPC'12 Pune,India J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for open mpi. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1 –8, march 2007 G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In Supercomputing, ACM/IEEE 2002 Conference, page 29, nov. 2002. Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC ’06 ACM, 2006. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004. J.S. Plank, Youngbae Kim, and J.J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on, pages 351 –360, jun 1995. Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS ’11, pages 162–171. ACM, 2011. Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, HPDC 2011, San Jose, CA, USA, June 8-11, 2011, pages 73–84, 2011. Zizhong Chen and J. Dongarra. A scalable checkpoint encoding algorithm for diskless checkpointing. In High Assurance Systems Engineering Symposium, 2008. HASE 2008. 11th IEEE, pages 71 –79, dec. 2008.
Thank you for listening.. HIPC'12 Pune,India Any questions?