240 likes | 356 Views
WP1 demo Grid “logical” checkpointing. Fabrizio Pacini (Datamat SpA, WP1 ) fabrizio.pacini@datamat.it. Workload Management (WP1). Middleware Demo Roadmap. Data Management (WP2). Networking (WP7). Storage Element (WP5). Information Service (WP3). Fabric Management (WP4).
E N D
WP1 demoGrid “logical” checkpointing Fabrizio Pacini (Datamat SpA, WP1 ) fabrizio.pacini@datamat.it
Workload Management (WP1) Middleware Demo Roadmap Data Management (WP2) Networking (WP7) Storage Element (WP5) Information Service (WP3) Fabric Management (WP4)
Job checkpointing • Checkpointing: saving from time to time job state • Useful to prevent data loss, due to unexpected failures • To allow job preemption • Also exploited in the job partitioning framework (see D1.4 for details) • Approach: provide users with a “trivial” logical job checkpointing service • User can save from time to time the state of the job (defined by the application) • A job can be restarted from an intermediate (i.e. “previously” saved) job state • Different than “classical’ checkpointing (i.e. saving all the information related to a process: process’s data and stack segments, open files, etc.) • Very difficult to apply (e.g. problems to save the state of open network connections) • Not necessary for all the DataGrid reference applications • Sequential processing cases • The state of the application is represented by a small amount of information defined by the application itself
Job checkpointing example Example of Application (e.g. HEP MonteCarlo simulation) int main () { … for (int i=event; i < EVMAX; i++) { < process event i>;} ... exit(0); }
Job checkpointing example User code must be easily instrumented in order to exploit the checkpointing framework … #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }
Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } • User defines what is a state • Defined as <var, value> pairs • Must be “enough” to restart a • computation from a • previously saved state
Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User can save from time to time the state of the job
Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } Retrieval of the last saved state The job can restart from that point
Job checkpointing Job checkpoint states saved in the LB server Retrieval of job checkpoint Logging & Bookkeeping Server Job • Also used (even in rel. 1) as repository of job status info • Already proved to be robust and reliable • The load can be distributed between multiple LB servers, to address scalability problems Saving of job checkpoint state state.saveState()
Demo • Purpose • To show how job checkpointing helps addressing and managing failures • Application used for demo • HEP application which fills an histogram • Application instrumented with WP1 checkpointing library • To save from time to time the intermediate state (number of events processed so far and pathname of intermediate histogram file) • To be able to restart its computation from a previously saved state • Scenario • Job submitted to a CE • When job runs it saves from time to time its state • Job failure (triggered simulating by hand a CE problem) • Job resubmitted by the WMS possibly to a different CE • Job restarts its computation from the last saved state • No need to restart from the beginning • The computation done till that moment is not lost
Testbed for this demo • UI • Running on a notebook here (Linux 6.2) • Other WMS services (NS, WM, JC, LB) • Running on a machine at INFN-CNAF, Bologna (Linux RH 6.2) • CEs • A notebook here (Linux RH 6.2): the one which will have a problem … • A LSF farm at INFN-Padova (Linux RH 6.2) • A PBS farm at INFN-Milano (Linux RH 7.3) • A PBS farm at CESNET-Prague (Debian 2.2)
UI Computing Element X Computing Element Y Job checkpointing scenario RB node Network Server Workload Manager Logging & Bookkeeping Server Job Contr. - CondorG
edg-job-submit jobchkpt.jdl jobchkpt.jdl [JobType = “Checkpointable”; Executable = "hsum.exe"; StdOutput = Outfile; InputSandbox = "/home/user/hsum.exe”, OutputSandbox = “Outfile”, Requirements = member("ROOT", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("CHKPT", other.GlueHostApplicationSoftwareRunTimeEnvironment); Rank = -other.GlueCEStateEstimatedResponseTime;] Job Status UI Computing Element X Computing Element Y RB node Job checkpointing scenario submitted Replica Catalog Network Server Workload Manager Logging & Bookkeeping Server Job Description Language (JDL) to specify job characteristics and requirements UI: allows users to access the functionalities of the WMS Job Contr. - CondorG
submitted waiting UI ready scheduled running Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server 1 Job Match- maker 1 Job 2 3 Input Sandbox files Workload Manager Logging & Bookkeeping Server RB storage 4 Job Adapter 5 Job Job Contr. - CondorG 6 Input Sandbox files 6 Job
submitted waiting UI ready scheduled running Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG … <save intermediate files>; State.saveValue(“var1”, value1>; … State.saveValue(“varn”, valuen); State.saveState(); … From time to time user’s job asks to save the intermediate state
submitted waiting UI ready scheduled running Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG Saving of intermediate files Saving of job state
submitted waiting UI ready scheduled running done (failed) Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG Job fails (e.g. for a CE problem) Computing Element X Computing Element Y
submitted waiting UI ready scheduled running done (failed) waiting Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Match- maker Workload Manager Logging & Bookkeeping Server RB storage Where must this job be executed ? Possibly on a different CE where the job was previously submitted … Reschedule and resubmit job Job Contr. - CondorG Job
submitted waiting UI ready scheduled running done (failed) waiting Computing Element X Computing Element Y RB node Job Status Job checkpointing scenario Network Server Match- maker Workload Manager Logging & Bookkeeping Server RB storage CE choice: CEy Job Contr. - CondorG
ready UI scheduled running done (failed) waiting ready Computing Element X Computing Element Y RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Adapter Job Job Contr. - CondorG CE characts & status
ready UI scheduled running done (failed) waiting ready scheduled Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario Network Server Workload Manager Logging & Bookkeeping Server RB storage Job Contr. - CondorG Input Sandbox files Job
running running UI Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario scheduled Network Server Workload Manager Logging & Bookkeeping Server done (failed) RB storage waiting Retrieval of last saved state when job starts Job Contr. - CondorG ready Retrieval of intermediate files (previously saved) scheduled
running running UI Computing Element X Computing Element Y Job RB node Job Status Job checkpointing scenario scheduled Network Server Workload Manager Logging & Bookkeeping Server done (failed) RB storage waiting Job Contr. - CondorG Job keeps running starting from the point corresponding to the retrieved state (doesn’t need to start from the beginning) ready scheduled Job
Summary • The Workload Management System was re-factored to streamline the flow of job information, therefore addressing problems and shortcomings found with release 1.x. • The re-factored components also provide hooks and features to support new functionality. • Among these, we chose to demonstrate Grid “logical” checkpointing, as it allows applications to achieve one very important degree of freedom over the Grid and is minimally intrusive. • The implemented Checkpointing API has been discussed with the DataGrid reference applications since June 2002, and was presented and well received in the GGF Grid Checkpointing WG.