Using failure injection mechanisms to experiment and evaluate a grid failure detector

Using failure injection mechanisms to experiment and evaluate a grid failure detector Sébastien Monnet and Marin Bertier IRISA / INRIA, PARIS project-team WCGC 2006 - Rio de Janeiro

Systems evaluation • Simulations • Fast/easy • System model • Formal proofs • Reliable • System model • Experimentations on real testbeds • Real system code / real environment • Hard ! WCGC 2006 - Rio de Janeiro

Running experimentations • Find ressources • Deploy the system • Launch the test • Control the test • Get and analyze results WCGC 2006 - Rio de Janeiro

Experimenting fault-tolerance • Evaluate fault tolerance mechanisms • Fault-free runs • With failures • Fault prevention cost • Resilience to failures • Overhead due to failures (recovery, adaptation, etc.) WCGC 2006 - Rio de Janeiro

Volatility control - needs • Assumption: a stable testbed • Injecting failures • At large scale • Accurately • Reproducibly • Using failure scenarios WCGC 2006 - Rio de Janeiro

JXTA Distributed Framework(JDF) • A tool to automate the tests of JXTA-based systems (Sun Microsystems, Paris research team) • Test description • Nodes file • Files to deploy file • XML file describing nodes profile • Set of scripts to deploy, launch and fetch results WCGC 2006 - Rio de Janeiro

Adding a specific XML tag for failure injection <failure grp=“groupName”> <failure dep=“profileName”> Single failure Correlated failures (00) <network analyze-class="test.Analyze"> (01) <profile name="manager" replicas="1"> (02)  (03) <peer base-name="peerA"/> ... (11) <bootstrap class="test.MyClass1"/> (12)  (13) <arg value="x"/> (14) </profile> (15) <profile name="non-manager" replicas="20"> (16) <peer base-name="peerB"/> ... (23) <bootstrap class="test.MyClass2"/> (24) </profile> (25) </network> Description language extension WCGC 2006 - Rio de Janeiro

Injecting failures - when ? • Active research field • A failure schedule generator • Input • The failure tags in the XML description file • Probabilistic parameters (MTBF) • Output • A new configuration file for JDF Format: peerID=uptime WCGC 2006 - Rio de Janeiro

Injecting failures - how ? WCGC 2006 - Rio de Janeiro

Using failure injectors (1) • Launching a simple test • Correlated failures WCGC 2006 - Rio de Janeiro

Using failure injectors (2) • Refining the failure schedule WCGC 2006 - Rio de Janeiro

Failure detectors • Basic building bloc for fault-tolerance mechanisms • Basic principle • Periodical Heartbeat exchanges (all-to-all) • On each node a suspects list is updating according heartbeats arrivals WCGC 2006 - Rio de Janeiro

Grid failure detectors (GFD) • Adaptability • Network load • Quality of service • Scalability • Hierarchical failure detectors • All-to-all within clusters • Leader-to-leader among clusters WCGC 2006 - Rio de Janeiro

Experimental testbed • Grid5000 grid platform • 9 cities inter-connected by Renater • Bandwidth: 1Gb/s (10Gb/s soon) • Latency: from 4 to ~30ms • In each city clusters provides high performance networks • Bandwidth: 1Gb/s • Latency: few micro seconds http://www.grid5000.fr/ WCGC 2006 - Rio de Janeiro

Experimental setup • 64 nodes partitioned in 4 different cities Cluster 1 Cluster 2 Cluster 4 Cluster 3 WCGC 2006 - Rio de Janeiro

Failure injector - alone • MTBF = 1 minute • No failure dependencies WCGC 2006 - Rio de Janeiro

Correlated failures • Adding a failure dependencies in cluster 1: <failure dep=“cluster1-leader”> Cluster1 leader crashes WCGC 2006 - Rio de Janeiro

Failure détection in subgroups • No leader failures • No failure dependencies WCGC 2006 - Rio de Janeiro

Between groups • Failure dependency in each cluster to avoid new leader selection WCGC 2006 - Rio de Janeiro

Conclusion • Evaluating a distributed system is complex • Running experimentations provides the ability to • Evaluate a new concept or software • Debug during implementation phase • Failure-injection mechanisms provide the ability to experiment fault-tolerance mechanisms • We have designed a failure injection tool that allows the tester to run large scale experiments • with various volatility conditions • In a reproducible manner WCGC 2006 - Rio de Janeiro

Using failure injection mechanisms to experiment and evaluate a grid failure detector