100 likes | 204 Views
Process Management. Meeting at Argonne February 24-25, 2003. Schematic of Process Management Component in Context. NSM. SD. Sched. EM. MPD’s. SSS Components. QM. PM. PM. SSS XML. application processes. mpdrun. simple scripts using SSS XML. Brett’s job submission language.
E N D
Process Management Meeting at Argonne February 24-25, 2003
Schematic of Process Management Component in Context NSM SD Sched EM MPD’s SSS Components QM PM PM SSS XML application processes mpdrun simple scripts using SSS XML Brett’s job submission language XML file mpiexec (MPI Standard args) interactive “Official” SSS side Prototype MPD-based implementation side
MPD Progress • MPD-2 • In Python • Distributed as part of MPICH-2 (implements PMI) • Supports requirements of SSS • Separate executables for each process • Separate arguments for each process • Separate environment variables for each process • Supports MPI Standard mpiexec as job start command • Includes some of the SSS requirements
New XML for Process Manager <create-process-group pgid='job23' submitter='lusk' totalprocs='10' output='discard' > <process-spec range='1' exec='cpi_master' user='ell' cwd='/home/ell/rundir' path='/home/ell/progs' coprocess='tvdebuggersrv' /> <arg idx='1' val='-loops' /> <arg idx='2' val='1000' /> <env name='TV_LICENSE' val='23416784' /> </process-spec> <process-spec range='2-10' exec='cpi_slave' user='ell' cwd='/home/ell/rundir' path='/home/ell/progs' coprocess='tvdebuggersrv' /> <env name='TV_LICENSE' val='23416784' /> </process-spec> <host-spec idx=‘1’ val=‘ccn%s:64-68’ /> <host-spec idx=‘2’ val=‘ccn%s:70-74’ /> </create-process-group>
Querying the PM The following example retrieves the pgid's of processes that were submitted by lusk or desai, and in lusk's case, only returns the process groups that have processes running on two specific hosts. The restrictions are on the process groups; we always return all the processes in a process group. <get-process-groups> <process-group submitter='lusk' pgid='*' totalprocs='*' > <process-group-restriction pid='*' exec='*' host='ccn70' \> <process-group-restriction pid='*' exec='*' host='ccn230' \> </process-group> <process-group submitter='desai' pgid='*' > </process-group> </get-process-groups>
Response to a Query The message returned by such a query is a set of process groups, with details on their processes filled in as requested by the query. <process-groups> <process-group submitter='lusk' pgid='4521' totalprocs='10'> <process pid='3456' exec='cpi_master' host='ccn64' /> <process pid='1324' exec='cpi_slave' host='ccn65' /> <process pid='7654' exec='cpi_slave' host='ccn66' /> <process pid='6758' exec='cpi_slave' host='ccn67' /> <process pid='9601' exec='cpi_slave' host='ccn68' /> <process pid='7865' exec='cpi_slave' host='ccn70' /> <process pid='9876' exec='cpi_slave' host='ccn71' /> <process pid='6524' exec='cpi_slave' host='ccn72' /> <process pid='3452' exec='cpi_slave' host='ccn73' /> <process pid='5634' exec='cpi_slave' host='ccn74' /> </process-group> <process-group submitter='lusk' pgid='23' totalprocs='1'> <process pid='5554' exec='mpd' host='230' /> </process-group> <process-group submitter='desai' pgid='244' > </process-group> </process-groups>
Using the Wildcard Syntax for More The following command sends a signal 3 to all the processes of all jobs submitted by lusk, and returns the details of which processes groups they were. <signal-process-group signal='3'> <process-group submitter='lusk' pgid='*' /> </signal-process-group> The following command kills all process groups with processes running on ccn56, and returns their submitters, so that they can be told the sad news. <kill-process-group> <process-group submitter='*'> <process host='ccn56' > </process-group> </kill-process-group>
Starting to Have Fun • The combination of • Published interfaces • XML technology • XML libraries built into scripting languages • The SSS communication library enables simple programs for simple tasks. • Reminiscent of Unix pipe-based command-line programs • We use Python with built-in SAX-based library • Easy to connect other tools (CIT daemons)
Submitting a Job Directly to PM #! /usr/bin/env python from xml.dom.minidom import Document, parseString from ssslib import comm_lib executable = raw_input( 'executable? ' ) numprocs = raw_input( 'numprocs? ' ) pgid = raw_input( 'process group id? ' ) msg = xml.dom.minidom.Document().createElement( 'create-process-group‘ ) msg.setAttribute( 'totalprocs', numprocs ) msg.setAttribute( 'pgid', pgid ) ps = xml.dom.minidom.Document().createElement( 'process-spec‘ ) ps.setAttribute( 'exec', executable ) msg.appendChild( ps ) print msg.toprettyxml() comm = comm_lib( debug=0 ) process_manager = comm.ClientInit( 'process-manager‘ ) comm.SendMessage(process_manager, msg.toxml()) ack = comm.RecvMessage( process_manager ) comm.ClientClose( process_manager ) ack_dom = parseString( ack ) print ack_dom.toxml()
Registering for Notification of PM Events #! /usr/bin/env python from sss import event_receiver from xml.dom.minidom import Document class printjobevent: def __init__(self): self.dispatch = { 'event' : self.HandleXMLEvent } def HandleXMLEvent( self, xe, ( peer, port ) ): print ' %s, jobid = %s, at %s' % ( xe.getAttribute( 'msg' ), xe.getAttribute( 'data' ), xe.getAttribute( 'time' )) return Document().createElement('event-ok') if __name__ == '__main__': job_monitor = printjobevent() loop = event_receiver( 'process-manager', '*', '*', 'many', job_monitor )