250 likes | 395 Views
Supporting MPI Applications on EGEE Grids Zolt án Farkas MTA SZTAKI. Contents. MPI Standards Implementations EGEE and MPI History Current status Working/research groups in EGEE Future Works P-GRADE Grid Portal Workflow execution, file handling Direct job submission
E N D
Supporting MPI Applications onEGEE GridsZoltán FarkasMTA SZTAKI
Contents • MPI • Standards • Implementations • EGEE and MPI • History • Current status • Working/research groups in EGEE • Future Works • P-GRADE Grid Portal • Workflow execution, file handling • Direct job submission • Brokered job submission
MPI Budapest, 5 July 2006 • MPI stands for Message Passing Interface • Standards 1.1 and 2.0 • MPI Standard features: • Collective communication (1.1+) • Point-to-Point communication (1.1+) • Group management (1.1+) • Dynamic Processes (2.0) • Programming Language APIs • …
MPI Implementations Budapest, 5 July 2006 • MPICH • Freely available implementation of MPI • Runs on many architectures (even on Windows) • Implements Standards 1.1 (MPICH) and 2.0 (MPICH2) • Supports Globus (MPICH-G2) • Nodes are allocated upon application execution • LAM/MPI • Open-source implementation of MPI • Implements Standards 1.1 and parts of 2.0 • Many interesting features (checkpoint) • Nodes are allocated before application execution • Open MPI • Implements Standard 2.0 • Uses technologies of other projects
MPICH execution on x86 clusters Budapest, 5 July 2006 • Application can be started … • … using ‘mpirun’ • … specifying: • number of requested nodes (-np <nodenumber>), • a file containing the nodes to be allocated (-machinefile <arg>) [OPTIONAL], • the executable, • executable arguments. • $ mpirun –np 7 ./cummu –N –M –p 32 • Processes are spawned using ‘rsh’ or ‘ssh’, depending on the configuration
MPICH x86 execution – requirements Budapest, 5 July 2006 • Executable (and input files) must be present on worker nodes: • Using Shared Filesystem, or • User distributes the files before invoking ‘mpirun’. • Accessing worker nodes from the host running ‘mpirun’: • Using ‘rsh’ or ‘ssh’ • Without user interaction (host-based authentication)
EGEE and MPI Budapest, 5 July 2006 • MPI became important at the end of 2005/beginning of 2006: • Intructions about CE/jobmanager/WN configuration • The user has to start a wrapper script • the input sandbox isn’t distributed to worker nodes • sample wrapper script, which works for PBS, LFS and assumes ssh • Current status (according to experiments): • No need to use wrapper scripts • MPI jobs fail in case on no shared filesystems • Remote file handling not supported, so user has to take care
EGEE and MPI - II Budapest, 5 July 2006 • Research/Working groups formed: • MPI TCG WG: • User requirements: • „Shared” filesystem: distribute executable and input files • Storage Element handling • Site requirements: • Solution must be compatible with a big number of jobmanagers • Infosystem extensions (max. number of concurent CPUs used by a job, …) • MSc research group (1-month project, 2 students): • Created wrapper scripts for MPICH, LAM/MPI, OpenMPI • Application source is compiled before execution • Executable and input files are distributed to allocated worker nodes, ‘ssh’ is assumed • No remote file support
EGEE and MPI – Future work Budapest, 5 July 2006 • Add support for: • all possible jobmanagers • all possible MPI implementations • Storage Element handling in case of legacy applications • input sandbox distribution in case of no shared filesystems before application execution • output file collection in case of no shared filesystems after the application has been executed
P-GRADE Grid Portal Budapest, 5 July 2006 • Workflow execution: • DAGMan as workflow scheduler • pre and post script to perform tasks around job exeution • Direct job execution using GT-2: • GridFTP, GRAM • pre: create temporary storage directory, copy input files • job: Condor-G is executing a wrapper script • post: download results • Job execution using EGEE broker (both LCG/gLite): • pre: create application context as input sandbox • job: Scheduler universe Condor job executing a script, which does job submission, status polling, output downloading. A wrapper script is submitted to the broker • post: error checking
Portal: File handling Budapest, 5 July 2006 • „Local” files: • User has access to these files through the Portal • Local input files are uploaded from the user machine • Local output files are downloaded to the user machine • „Remote” files: • Files reside on EGEE Storage Elements or are accessible using GridFTP • EGEE SE files: • lfn:/… • guid:… • GridFTP files: gsiftp://…
Portal: Direct job execution Budapest, 5 July 2006 • The resource to be used is known before job execution • The user must have a valid, accepted certificate • Local files are supported • Remote GridFTP files are supported, even in case of grid-unaware applications • Jobs may be sequential or MPI applications
Direct exec: step-by-step I. Budapest, 5 July 2006 • Pre script: • creates a storage directory on the selected site’s front-end node, using the ‘fork’ jobmanager • local input files are copied to this directory from the Portal machine using GridFTP • remote input files are copied using GridFTP (in case of errors, a two-phase copy is tried using Portal machine) • Condor-G job: • a wrapper script (wrapperp) is specified as the real executable • a single job is submitted to the requested jobmanager, for MPI jobs the ‘hostcount’ RSL attribute is used to specify the number of requested nodes
Direct exec: step-by-step II. Budapest, 5 July 2006 • LRSM: • allocate the number of requested nodes (if needed) • start wrapperp on one of the allocated nodes (master worker node) • Wrapperp (running on master worker node): • copies the executable and input files from the front-end node (‘scp’ or ‘rcp’) • in case of PBS jobmanagers, executable and input files are copied to the allocated nodes (PBS_NODEFILE). In case of non-PBS jobmanagers, shared filesystem is required, as the host names of the allocated nodes cannot be determined • wrapperp searches for ‘mpirun’ • the real executable is started using the found ‘mpirun’ • in case of PBS jobmanagers, output files are copied from the allocated worker nodes to the master worker node) • output files are copied to the front-end node
Direct exec: step-by-step III. Budapest, 5 July 2006 • Post script: • local output files are copied from the temporary working directory created by the pre script to the Portal machine using GridFTP • remote output files are copied using GridFTP (in case of errors, a two-phase copy is tried using Portal machine) • DAGMan: schedule next jobs…
Direct execution: animated Portal machine Remote file storage 1 2 5 1 5 Fork GridFTP Temp. Storage PBS Master WN 3 Wrapperp 4 Slave WN1 Slave WNn-1 In/exe mpirun In/exe In/exe Executable 4 Executable Executable Output Output Output Budapest, 5 July 2006
Direct Submission Summary Budapest, 5 July 2006 • Pros: • Users can add remote file support to legacy applications • Works for both sequential and MPI(CH) applications • For PBS jobmanagers, there is no need to have a shared filesystem (support for other jobmanagers can be added, depends on informations provided by jobmanagers) • Works in case of jobmanagers, which do not support MPI • Faster, than submitting with the broker • Cons: • doesn’t integrate into the EGEE middleware • user needs to specify the execution resource • currently doesn’t work on non-PBS jobmanagers without shared filesystems
Portal: Brokered job submission Budapest, 5 July 2006 • The resource to be used is unknown before job execution • The user must have a valid, accepted certificate • Local files are supported • Remote files residing on Storage Elements are supported, even in case of grid-unaware applications • Jobs may be sequential or MPI applications
Broker exec: step-by-step I. Budapest, 5 July 2006 • Pre script: • creates the Scheduler universe Condor submit file • Scheduler Universe Condor job: • the job is a shell script • the script is responsible for: • job submission: a wrapper script (wrapperrb) is specified as the real executable in the JDL file • job status polling • job output downloading
Broker exec: step-by-step II. Budapest, 5 July 2006 • Resource Broker: • handles requests of the Scheduler universe Condor job • sends the job to a CE • watches its exeution • reports errors • … • LRMS on CE: • allocates the requested number of nodes • starts wrapperrb on the master worker node using ‘mpirun’
Broker exec: step-by-step III. Budapest, 5 July 2006 • Wrapperrb: • the script is started by ‘mpirun’, so this script starts on every allocated worker node like an MPICH process • checks if remote input files are already present. If not, they are downloaded from the storage element • if the user specified any remote output files, they are removed from the storage • the real executable is started with the arguments passed to the script. These arguments already contain MPICH-specific ones • after the executable has been finished, remote output files are uploaded to the storage element (only in case of gLite) • Post script: • nothing special…
Broker execution: animated Budapest, 5 July 2006 Portal Machine 2 Storage Element Resource Broker 5 3 Master WN … mpirun Globus Front-end node 5 4 Slave WN1 Slave WNn-1 PBS wrapperrb wrapperrb wrapperrb Real exe Real exe Real exe 5
Broker Submission Summary Budapest, 5 July 2006 • Pros: • adds support for remote file handling in case of legacy applications • extends the functionality of the EGEE broker • one solution supports both sequential and MPI applications • Cons: • slow application execution • status polling generates high load with 500+ jobs
Experimental results • Tested some selected SEEGRID CEs using the broker from command line and the direct job submission from P-GRADE Portal with a job requesting 3 nodes Budapest, 5 July 2006
Budapest, 5 July 2006 Thank you for your attention ?