WP1 demo Grid “logical” checkpointing

WP1 demoGrid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 ) fabrizio.pacini@datamat.it

Workload Management (WP1) Middleware Demo Roadmap Data Management (WP2) Networking (WP7) Storage Element (WP5) Information Service (WP3) Fabric Management (WP4)

Job checkpointing • Checkpointing: saving from time to time job state • Useful to prevent data loss, due to unexpected failures • To allow job preemption • Also exploited in the job partitioning framework (see D1.4 for details) • Approach: provide users with a “trivial” logical job checkpointing service • User can save from time to time the state of the job (defined by the application) • A job can be restarted from an intermediate (i.e. “previously” saved) job state • Different than “classical’ checkpointing (i.e. saving all the information related to a process: process’s data and stack segments, open files, etc.) • Very difficult to apply (e.g. problems to save the state of open network connections) • Not necessary for all the DataGrid reference applications • Sequential processing cases • The state of the application is represented by a small amount of information defined by the application itself

Job checkpointing example Example of Application (e.g. HEP MonteCarlo simulation) int main () { … for (int i=event; i < EVMAX; i++) { < process event i>;} ... exit(0); }

Job checkpointing example User code must be easily instrumented in order to exploit the checkpointing framework … #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }

Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } • User defines what is a state • Defined as <var, value> pairs • Must be “enough” to restart a • computation from a • previously saved state

Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User can save from time to time the state of the job

Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } Retrieval of the last saved state The job can restart from that point

Job checkpointing Job checkpoint states saved in the LB server Retrieval of job checkpoint Logging & Bookkeeping Server Job • Also used (even in rel. 1) as repository of job status info • Already proved to be robust and reliable • The load can be distributed between multiple LB servers, to address scalability problems Saving of job checkpoint state state.saveState()

Demo • Purpose • To show how job checkpointing helps addressing and managing failures • Application used for demo • HEP application which fills an histogram • Application instrumented with WP1 checkpointing library • To save from time to time the intermediate state (number of events processed so far and pathname of intermediate histogram file) • To be able to restart its computation from a previously saved state • Scenario • Job submitted to a CE • When job runs it saves from time to time its state • Job failure (triggered simulating by hand a CE problem) • Job resubmitted by the WMS possibly to a different CE • Job restarts its computation from the last saved state •  No need to restart from the beginning •  The computation done till that moment is not lost

Testbed for this demo • UI • Running on a notebook here (Linux 6.2) • Other WMS services (NS, WM, JC, LB) • Running on a machine at INFN-CNAF, Bologna (Linux RH 6.2) • CEs • A notebook here (Linux RH 6.2): the one which will have a problem … • A LSF farm at INFN-Padova (Linux RH 6.2) • A PBS farm at INFN-Milano (Linux RH 7.3) • A PBS farm at CESNET-Prague (Debian 2.2)

UI Computing Element X Computing Element Y Job checkpointing scenario RB node Network Server Workload Manager Logging & Bookkeeping Server Job Contr. - CondorG

edg-job-submit jobchkpt.jdl jobchkpt.jdl [JobType = “Checkpointable”; Executable = "hsum.exe"; StdOutput = Outfile; InputSandbox = "/home/user/hsum.exe”, OutputSandbox = “Outfile”, Requirements = member("ROOT", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("CHKPT", other.GlueHostApplicationSoftwareRunTimeEnvironment); Rank = -other.GlueCEStateEstimatedResponseTime;] Job Status UI Computing Element X Computing Element Y RB node Job checkpointing scenario submitted Replica Catalog Network Server Workload Manager Logging & Bookkeeping Server Job Description Language (JDL) to specify job characteristics and requirements UI: allows users to access the functionalities of the WMS Job Contr. - CondorG

submitted waiting UI ready scheduled running Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server 1 Job Match- maker 1 Job 2 3 Input Sandbox files Workload Manager Logging & Bookkeeping Server RB storage 4 Job Adapter 5 Job Job Contr. - CondorG 6 Input Sandbox files 6 Job

submitted waiting UI ready scheduled running Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG … <save intermediate files>; State.saveValue(“var1”, value1>; … State.saveValue(“varn”, valuen); State.saveState(); … From time to time user’s job asks to save the intermediate state

submitted waiting UI ready scheduled running Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG Saving of intermediate files Saving of job state

submitted waiting UI ready scheduled running done (failed) Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG Job fails (e.g. for a CE problem) Computing Element X Computing Element Y

submitted waiting UI ready scheduled running done (failed) waiting Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Match- maker Workload Manager Logging & Bookkeeping Server RB storage Where must this job be executed ? Possibly on a different CE where the job was previously submitted … Reschedule and resubmit job Job Contr. - CondorG Job

submitted waiting UI ready scheduled running done (failed) waiting Computing Element X Computing Element Y RB node Job Status Job checkpointing scenario Network Server Match- maker Workload Manager Logging & Bookkeeping Server RB storage CE choice: CEy Job Contr. - CondorG

ready UI scheduled running done (failed) waiting ready Computing Element X Computing Element Y RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Adapter Job Job Contr. - CondorG CE characts & status

ready UI scheduled running done (failed) waiting ready scheduled Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG Input Sandbox files Job

running running UI Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario scheduled Network Server Workload Manager Logging & Bookkeeping Server done (failed) RB storage waiting Retrieval of last saved state when job starts Job Contr. - CondorG ready Retrieval of intermediate files (previously saved) scheduled

running running UI Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario scheduled Network Server Workload Manager Logging & Bookkeeping Server done (failed) RB storage waiting Job Contr. - CondorG Job keeps running starting from the point corresponding to the retrieved state (doesn’t need to start from the beginning) ready scheduled Job

Summary • The Workload Management System was re-factored to streamline the flow of job information, therefore addressing problems and shortcomings found with release 1.x. • The re-factored components also provide hooks and features to support new functionality. • Among these, we chose to demonstrate Grid “logical” checkpointing, as it allows applications to achieve one very important degree of freedom over the Grid and is minimally intrusive. • The implemented Checkpointing API has been discussed with the DataGrid reference applications since June 2002, and was presented and well received in the GGF Grid Checkpointing WG.

WP1 demo Grid “logical” checkpointing