1 / 20

Using failure injection mechanisms to experiment and evaluate a grid failure detector

Using failure injection mechanisms to experiment and evaluate a grid failure detector. Sébastien Monnet and Marin Bertier IRISA / INRIA, PARIS project-team. Systems evaluation. Simulations Fast/easy System model Formal proofs Reliable System model Experimentations on real testbeds

addison
Download Presentation

Using failure injection mechanisms to experiment and evaluate a grid failure detector

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using failure injection mechanisms to experiment and evaluate a grid failure detector Sébastien Monnet and Marin Bertier IRISA / INRIA, PARIS project-team WCGC 2006 - Rio de Janeiro

  2. Systems evaluation • Simulations • Fast/easy • System model • Formal proofs • Reliable • System model • Experimentations on real testbeds • Real system code / real environment • Hard ! WCGC 2006 - Rio de Janeiro

  3. Running experimentations • Find ressources • Deploy the system • Launch the test • Control the test • Get and analyze results WCGC 2006 - Rio de Janeiro

  4. Experimenting fault-tolerance • Evaluate fault tolerance mechanisms • Fault-free runs • With failures • Fault prevention cost • Resilience to failures • Overhead due to failures (recovery, adaptation, etc.) WCGC 2006 - Rio de Janeiro

  5. Volatility control - needs • Assumption: a stable testbed • Injecting failures • At large scale • Accurately • Reproducibly • Using failure scenarios WCGC 2006 - Rio de Janeiro

  6. JXTA Distributed Framework(JDF) • A tool to automate the tests of JXTA-based systems (Sun Microsystems, Paris research team) • Test description • Nodes file • Files to deploy file • XML file describing nodes profile • Set of scripts to deploy, launch and fetch results WCGC 2006 - Rio de Janeiro

  7. Adding a specific XML tag for failure injection <failure grp=“groupName”> <failure dep=“profileName”> Single failure Correlated failures (00) <network analyze-class="test.Analyze"> (01) <profile name="manager" replicas="1"> (02) <!-- peer information --> (03) <peer base-name="peerA"/> ... (11) <bootstrap class="test.MyClass1"/> (12) <!-- argument --> (13) <arg value="x"/> (14) </profile> (15) <profile name="non-manager" replicas="20"> (16) <peer base-name="peerB"/> ... (23) <bootstrap class="test.MyClass2"/> (24) </profile> (25) </network> Description language extension WCGC 2006 - Rio de Janeiro

  8. Injecting failures - when ? • Active research field • A failure schedule generator • Input • The failure tags in the XML description file • Probabilistic parameters (MTBF) • Output • A new configuration file for JDF Format: peerID=uptime WCGC 2006 - Rio de Janeiro

  9. Injecting failures - how ? WCGC 2006 - Rio de Janeiro

  10. Using failure injectors (1) • Launching a simple test • Correlated failures WCGC 2006 - Rio de Janeiro

  11. Using failure injectors (2) • Refining the failure schedule WCGC 2006 - Rio de Janeiro

  12. Failure detectors • Basic building bloc for fault-tolerance mechanisms • Basic principle • Periodical Heartbeat exchanges (all-to-all) • On each node a suspects list is updating according heartbeats arrivals WCGC 2006 - Rio de Janeiro

  13. Grid failure detectors (GFD) • Adaptability • Network load • Quality of service • Scalability • Hierarchical failure detectors • All-to-all within clusters • Leader-to-leader among clusters WCGC 2006 - Rio de Janeiro

  14. Experimental testbed • Grid5000 grid platform • 9 cities inter-connected by Renater • Bandwidth: 1Gb/s (10Gb/s soon) • Latency: from 4 to ~30ms • In each city clusters provides high performance networks • Bandwidth: 1Gb/s • Latency: few micro seconds http://www.grid5000.fr/ WCGC 2006 - Rio de Janeiro

  15. Experimental setup • 64 nodes partitioned in 4 different cities Cluster 1 Cluster 2 Cluster 4 Cluster 3 WCGC 2006 - Rio de Janeiro

  16. Failure injector - alone • MTBF = 1 minute • No failure dependencies WCGC 2006 - Rio de Janeiro

  17. Correlated failures • Adding a failure dependencies in cluster 1: <failure dep=“cluster1-leader”> Cluster1 leader crashes WCGC 2006 - Rio de Janeiro

  18. Failure détection in subgroups • No leader failures • No failure dependencies WCGC 2006 - Rio de Janeiro

  19. Between groups • Failure dependency in each cluster to avoid new leader selection WCGC 2006 - Rio de Janeiro

  20. Conclusion • Evaluating a distributed system is complex • Running experimentations provides the ability to • Evaluate a new concept or software • Debug during implementation phase • Failure-injection mechanisms provide the ability to experiment fault-tolerance mechanisms • We have designed a failure injection tool that allows the tester to run large scale experiments • with various volatility conditions • In a reproducible manner WCGC 2006 - Rio de Janeiro

More Related