260 likes | 418 Views
Grid Checkpoining Architecture. Radosław Januszewski CoreGrid Summer School 2007. motivation. The Grids are complex and therefore prone to errors. The distributed nature of the Grid makes scheduling of system maintenance hard.
E N D
Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007
motivation • The Grids are complex and therefore prone to errors. • The distributed nature of the Grid makes scheduling of system maintenance hard. • Each uncoordinated power-down or failure effects in loss of currently running applications. • Loss of computation time means additional cost! European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
goal • To enhance the reliability, fault-tolerance and robustness of the Grid computing environment. European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
the solution • Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
grid - model European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
GCA in the Grid European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
Proof of concept – the goals • check whether the GCA survives contact with the reality • prepare PoC on the basis of real-life installation • the Grid with the GCA should provide additional value comparing with the „traditional” approach European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
GCA proof of concept installation European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
involved elements • GUI: command line, Grid Sphere, Migrating Desktop • Broker: GRMS • Local Resource Manager: Globus + TORQUE • Core service: SGIckpt European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
Bottom-up approach • How to make the checkpointer work with the local resource manager? European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
pbs/torque special features • action checkpoint • action restart • action checkpoint_abort European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
config • $action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta • skid %path • $action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid • $restart_transmogrify true • $action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid • %jobid %sid %taskid %path • Detailed description accessible on the http://checkpointing.psnc.pl European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
Broker – local RM connectivity European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
problem • The checkpointer: a service or resource? European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
job description with checkpointing • <grmsJob appid="matrix_demo_submit"> • <task taskid="matrix" persistent="true" crucial="true"> • <resource> • <localrmname>pbs</localrmname> • </resource> • <executable type="multiple" count="1"> • <execfile name="matrixi"> • <url>gsiftp://xxx.xxx.xxx.xxxl//home/user/povray</url> • </execfile> • </executable> • <other> • <grms_id>${JOB_ID}</grms_id> • <checkpointable>true</checkpointable> • <period>1</period> • </other> • </task> • </grmsJob> European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
the end-user point of view European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
manual scenario European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
manual scenario - restart European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
<grmsJob appid="matrix_demo_resume"> • <task taskid="matrix" persistent="true" crucial="true"> • <resource> • <hostname>node-03.checkpointing.psnc.pl</hostname> • <localrmname>pbs</localrmname> • </resource> • <executable type="multiple" count="1"> • <execfile name="matrix_long"> • <url>gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long</url> • </execfile> • </executable> • <other> • <grms_id>${JOB_ID}</grms_id> • <recovery>true</recovery> • <ckpt_id>1179315947518_matrix_demo_submit_0459</ckpt_id> • <checkpointable>true</checkpointable> • <period>1</period> • </other> • </task> • </grmsJob> European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
failure – end-user view European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
problem • This semi-automatic solution is not optimal. • How to introduce automatic job failure handling without introducing new functionality in the Broker? • Use the workflows! European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
the workflow Problem: using this broker we are not able to model loops European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
automatic scenario European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
end-user point of view European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
the benefits • user: more robust and fault-tolerant Grid environment • sysadmin: much easier system management due to automatic checkpoint and recovery mechanism European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies
Thank you! European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies