260 likes | 379 Views
CiFTS Coordinated Infrastructure for Fault Tolerant Systems. Agenda. The Problem and the purpose The CIFTS framework The CIFTS team Call for Action. The Problem. No knowledge of this failure shared between System resources. MPI. detects “communication failure” with node X.
E N D
CiFTS Coordinated Infrastructure for Fault Tolerant Systems
Agenda • The Problem and the purpose • The CIFTS framework • The CIFTS team • Call for Action
The Problem No knowledge of this failure shared between System resources MPI detects “communication failure” with node X JS/RM continues to schedule jobs on same resources MPI Aborts! Application Aborts! More failures Cluster system software is agnostic of this MPI job failure Cluster system software is agnostic of this reason of MPI job failure Very less fault information sharing! No mechanism for global system knowledge!
The Purpose JS/RM MPI Not launch jobs on node X until Further diagnosis detects “communication failure” with node X MPI shares this failure knowledge with the system Application Diagnostics Utility Checkpoints itself Runs scripts for root-causing Node X problem Application Application Checkpoints itself Checkpoints itself
The Purpose JS/RM PVFS Launch jobs with NFS file system IO node failure. PVFS down Migrates existing jobs PVFS shares this information Application Checkpoints itself MPI-IO Prints a coherent error message Application Checkpoints itself
PVFS Universal Logger Checkpoint Restart System Resource Manager/JS Fault Tolerant Backplane Event Analysis System Components Automatic Actions Diagnostics Tools Middleware Like MPI MPI-IO Linear Algebra Libraries Autonomics Libraries and Applications The CIFTS Framework Fault Tolerant Backplane
Register with FTB Subscribe for events Publish events 1 2 3 A little deeper… Component Instance Component Instance 1 1 Distributed Fault Tolerant Backplane 2 2 3 3
CIFTS API - some primitives • FTB_Init (IN FTB_comp_info_t *comp_info, OUT FTB_client_handle_t *client_handle, OUT char *error_msg) • FTB_Publish_event (IN FTB_client_handle_t handle, IN char *event_name, IN FTB_event_data_t *datadetails, OUT char *error_msg) • FTB_Create_mask (INOUT FTB_event_mask_t *evt_mask, IN char *field_name, IN char *field_val, OUT char *error_msg) • FTB_Subscribe (IN FTB_client_handle_t chandle, IN FTB_event_mask_t *event_mask, OUT FTB_subscribe_handle_t *shandle, OUT char *error_msg IN int (*callback)(OUT FTB_catch_event_info_t *, OUT void*), IN void *arg) • FTB_Poll_for_event (IN FTB_subscribe_handle_t shandle, OUT FTB_catch_event_info_t *catch_event, OUT char *error_msg); • FTB_Finalize (IN FTB_client_handle_t handle);
Supported components BLCR FT-LA PVFS Fault Tolerant Backplane ROMIO ScaLAPACK Cobalt MVAPICH2 MPICH2 ZeptoOS OpenMPI LAMMPS LAM/MPI NWChem SWIM IPS
Status Quo • Alpha version under works • Demos available on SC exhibit floor • Client API to be finalized by Q1’ FY08 • First release target : ? • Platforms supported : Linux clusters, IBM BGL • Discuss more with Pete
CIFTS team • Argonne National Laboratory • Pete Beckman, Rinku Gupta, Ewing Lusk, Rob Ross, Rajeev Thakur • Indiana University • Andrew Lumsdaine & team • Lawrence Berkeley National Laboratory • Paul Hargrove • Oak Ridge National Laboratory • Al Geist, David Bernholdt • Ohio State University • D.K. Panda • University of Tennessee, Knoxville • Jack Dongarra
Call for Action FT-LA Lustre GPFS Intel MLK BLCR IBRIX GFS ScaLAPACK Polyserv Fault Tolerant Backplane Panasas PVFS MAUI Cobalt ROMIO Condor MVAPICH2 MPICH-MX SGE LSF MPICH2 LAMMPS ZeptoOS Intel MPI PBS/Pro Global Arrays OpenMPI SWIM IPS NWChem SLURM Linux LAM/MPI Other Applications Scali MPI Star-CD LS-Dyna MM5 BLAST Eclipse Fluent
Need more information? • SC’07 Exhibit floor • Demos and/or talks at ANL, ORNL and LBNL booth • CIFTS website • http://www.mcs.anl.gov/research/cifts/ • CIFTS wiki • http://wiki.mcs.anl.gov/cifts • CIFTS mailing list • cifts_discuss@googlegroups.com
Do we need a slide on timeline? • Do we need to go into more details on the design?
CIFTS - The working view PVFS Universal Logger Checkpoint Restart System Resource Manager/JS Event Analysis System Components Automatic Actions Diagnostics Tools Bootstrap Server Middleware Like MPI MPI-IO Linear Algebra Libraries Autonomics Libraries and Applications
FTB Internal Architecture Layers Component 1 Component n FTB Agent FTB Client API Client Library Linux BGL CRAY FTB Manager API Manager Library Manager Library Network Network Network Module1 Network Module2 Network Module1 Network Module2 Component software stack FTB Agent software stack
What you need to know! Component 1 Component n FTB Agent Just the FTB Client API Client Library Linux BGL CRAY FTB Manager API Manager Library Manager Library Network Network Network Module1 Network Module2 Network Module1 Network Module2 Component software stack FTB Agent software stack
Building a FTB-enabled sample component • List the events you may want to publish in an XML file (for convenience) • Use the API to make the component FTB-enabled • Publish and subscribe to events
FTB-Enabled Component Development (Step1) STEP 1: Create an XML file, outlining the publishable events <ftb_component_details> <namespace>ftb.ftb_examples.watchdog<namespace> <publish_event> <event_name>WATCH_DOG_EVENT</event_name> <event_severity>Info</event_severity> <event_desc>This event is used by watchdog</event_desc> </publish_event> <publish_event> … </publish_event> </ftb_component_details>
Developing a FTB-enabled component (Step 2) STEP 2: Enabling your FTB component! #include "libftb.h" #include "ftb_event_def.h" #include "ftb_throw_events.h" int main (int argc, char *argv[]) { strcpy(cinfo.comp_namespace, "FTB.FTB_EXAMPLES.Watchdog"); strcpy(cinfo.schema_ver, "0.5"); strcpy(cinfo.inst_name, "watchdog"); strcpy(cinfo.jobid,"watchdog-111"); strcpy(cinfo.catch_style,"FTB_POLLING_CATCH"); FTB_Init(&cinfo, &handle, err_msg); FTB_Register_publishable_events(handle, ftb_ftb_examples_watchdog_events, FTB_FTB_EXAMPLES_WATCHDOG_TOTAL_EVENTS, err_msg); FTB_Create_mask(&mask, "all", "init", err_msg); FTB_Subscribe(handle, &mask, &shandle, err_msg, NULL, NULL); FTB_Publish_event(handle, "WATCH_DOG_EVENT", publish_event_data, err_msg); FTB_Poll_for_event(shandle, &caught_event, err_msg); FTB_Finalize(handle); return 0; }
Developing a FTB-enabled component (Step 2..contd) STEP 2: Enabling your FTB component! Creating your subscribe event mask Create a mask to catch all events • FTB_Create_mask(&mask, "all", "init", err_msg); Create a mask to catch “WATCH_DOG_EVENT” • FTB_Create_mask(&mask, "all", "init", err_msg); 2. FTB_Create_mask(&mask, "event_name", "WATCH_DOG_EVENT", err_msg); Create a mask to catch events of severity fatal • FTB_Create_mask(&mask, "all", "init", err_msg); 2. FTB_Create_mask(&mask, “severity”, ”FTB_FATAL", err_msg);
Developing a FTB-enabled component (Step 3) STEP 3: Provide options to end user to compile your code with FTB • Modify configure.in and makefiles, so that you can compile your code • ./configure --with-ftb=<PATH to FTB install directory>
Setting up FTB environment Compiling FTB • Download FTB • ./configure --with-platform=linux --with-bstrap-name=bucco • make • make install
Using FTB Starting FTB • ./ftb_database_server • ./ftb_agent on all linux nodes • Run you component executables Connection Topology FTB Agent FTB Agent FTB Agent Agent contacts server FTB Agent Bootstrap DB server BS -Server provides parent address FTB Agent FTB Agent
Open Issues • Policy management • Global knowledge of component prioritization for handling events • How can components announce their FT capabilities? • How can components request for action from other components? • How to we establish scoping of events?