420 likes | 535 Views
Getting Started With The NPACI Grid & NPACKage Shannon Whitmore swhitmor@sdsc.edu http://npacigrid.npaci.edu http://npackage.npaci.edu. Overview. Introduction Getting Started on the NPACI Grid Tutorial. Defined . NPACI Grid Hardware, software, network, and data resources at
E N D
Getting Started With The NPACI Grid &NPACKageShannon Whitmoreswhitmor@sdsc.eduhttp://npacigrid.npaci.eduhttp://npackage.npaci.edu
Overview • Introduction • Getting Started on the NPACI Grid • Tutorial
Defined • NPACI Grid • Hardware, software, network, and data resources at • San Diego Supercomputer Center • Texas Advanced Computing Center • University of Michigan, Ann Arbor • California Institute of Technology – coming soon • NPACKage • An integrated set of grid middleware and advanced NPACI applications
Grid Resources • Blue Horizon (SDSC) • IBM power3-based clustered SMP system • 1,152 processors, 576 GB main memory • Longhorn (TACC) • IBM power4 system • 224 processor and 512 GB aggregate memory • Hypnos & Morpheus (UMichigan) • AMD-based Linux clusters • Hypnos: 128 nodes Morpheus 50 nodes • Each SMP node: two CPUs & one GB memory per processor
Why use the NPACI Grid? • Simplifies job submission • Globus: common scripting language for job submission • Condor-G: launch and monitor jobs from one site • Combines local resources with SC resources • Run small jobs locally, large jobs remotely • Enables portal development • Single point of access for tools • Simplifies complex interfaces for users
Why use the NPACI Grid? (cont’d) • Provides tools for distributed data management and analysis • SRB • Datacutter • Provides single sign-on capabilities
Caveats • Resources are intended for large jobs • Try not to run small jobs on the batch queues • Must plan in advance for large runs • Request machine allocations • Cannot run distributed jobs on batch resources concurrently
Why use NPACKage • Easier to port applications • Components tested before release • Consulting support available • Consistent packaging • Simplified installation/configuration process • Single web site for all documentation • Install NPACKage on your system today!
Accessing The Gridhttp://npacigrid.npaci.edu/user_getting_started.html
Accounts • Need an NPACI account? http://npacigrid.npaci.edu/expedited_account.html • Need an account extension? http://npacigrid.npaci.edu/account_extension_request.html • Username does not start with “ux”? consult@npaci.edu
Login Nodes • SDSC (Blue Horizon) • tf004i.sdsc.edu & tf005i.sdsc.edu (batch) • b80n01.sdsc.edu - b80n13.sdsc.edu (interactive) • TACC • longhorn.tacc.utexas.edu (batch & interactive) • Michigan • hypnos.engin.umich.edu (batch & interactive) • morpheus.engin.umich.edu (batch & interactive)
Setup your environment • Add the following to your shell initialization file on all NPACI Grid hosts • For csh based shells if ( ! $?NPACI_GRID_CURRENT ) then alias . source setenv NPACI_GRID_CURRENT /usr/npaci-grid-1.1 . $NPACI_GRID_CURRENT/setup.csh endif • For bourne based shells if [ "x$NPACI_GRID_CURRENT" = "x" ]; then export NPACI_GRID_CURRENT=/usr/npaci-grid-1.1 . $NPACI_GRID_CURRENT/setup.sh fi
Certificates Required to use the NPACI Grid • Used for authentication and encryption • Enables single sign-on capabilities On cert.npaci.edu • Run /usr/local/apps/pki_apps/cacl • Generates X.509 certificate • Creates your Distinguished Name – a globally unique ID identifying you as an individual
Certificates (cont’d) • Copy your .globus directory to all sites • Script: http://npacigrid.npaci.edu/Examples/copycert.sh • Wait for DN propagation in grid-mapfile • Maps local usernames on a given host to a DN
Verify Grid Access Connect to any login node • ssh longhorn.tacc.utexas.edu –l <username> • grid-proxy-init • generates a proxy certificate • provides single sign on capability • proxies are valid for one day • grid-proxy-destroy • Removes the proxy
Verify Grid Access (cont’d) • Authenticate your certificate at each site • globusrun -a -r hypnos.engin.umich.edu • globusrun -a -r morpheus.engin.umich.edu • globusrun -a -r longhorn.tacc.utexas.edu • globusrun -a -r tf004i.sdsc.edu • Problems? Contact us: • http://npacigrid.npaci.edu/contacts.html
TutorialClients and Serviceshttp://npacigrid.npaci.edu/tutorial.html
Overview • Running Jobs • Using Globus • Using Condor-G • Transferring Files • Resource and Monitoring Services • MDS/Ganglia • NWS
Gatekeeper • Handles globus job requests at a remote site • Manages authentication and security • Routes job requests to a jobmanager • Exists on all login nodes
Jobmanager • Manages job launching • Two jobmanagers on each gatekeeper host • Interactive • jobmanager-fork – default • Batch – interface to local schedulers • jobmanager-loadleveler - longhorn & horizon • jobmanager-pbs - hypnos & morpheus
Globus clients Three commands for remote job submission • globus-job-submit • globus-job-run • globus-run
globus-job-submit • Runs in background • Returns a contact string • Output from each job stored locally • $(HOME)/.globus/.gass_cache/….. • Example: • globus-job-submit morpheus.engin.umich.edu /bin/date
globus-job-submit (cont’d) Supporting commands • globus-job-status <contact-string> • globus-job-getoutput <contact-string> • globus-job-clean <contact-string>
globus-job-run • Runs in foreground • Provides executable staging • Output delivered directly • Example • globus-job-run hypnos.engin.umich.edu/jobmanager-pbs /bin/hostname
globusrun • Main command for submitting globus jobs • Uses the Resource Specification Language for specifying job options • Examples: • globusrun -f b80.rsl • globusrun -r hypnos.engin.umich.edu -f myjob.rsl
Sample RSL File + ( &(resourceManagerContact="longhorn.tacc.utexas.edu/jobmanager-loadleveler") (max_wall_time=45) (queue=normal) (max_memory=10) (directory=/paci/sdsc/ux444444/JobOutput) (executable=/bin/date) (stdout=longhorn-output) (stderr=longhorn-error) )
Required RSL Parameters Loadleveler at Texas (queue=normal) (max_wall_time=45) (max_memory=10) Loadleveler at SDSC b80’s: (queue=interactive) (max_wall_time=45) (environnent=(MP_EUIDEVICE en0)) tf004i/tf005i (queue=normal) (max_wall_time=45) PBS at Michigan hypnos (queue=route) (max_wall_time=45) (email_address=your@email) morpheus (queue=npaci) (max_wall_time=45) (email_address=your@email)
Condor-G • Provides job submission & monitoring at a single site: griddle.sdsc.edu • Handles file transfers & job I/O • Uses Globus to launch jobs • Provides a tool (DAGMan) for handling job dependencies
Condor Submit Description File # path to executable on remote host executable = /paci/sdsc/ux444444/hello.sh # do not stage executable from local to remote host transfer_executable = false # host and jobmanager where job is to be submitted globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork # condor-g always uses the globus universe universe=globus # local files where output, error, and logs will be placed output = hello.out error = hello.error log = hello.log # submit the job Queue
Condor-G Commands • condor_submit to launch job • condor_submit <description_file> • condor_q to monitor job • condor_rm to remove job • condor_rm <id> • condor_rm all
DAGMan • Metascheduler for Condor-G jobs • Directed acyclic graph (DAG) represents jobs & dependencies • Nodes (vertices) are jobs in the graph • Edges (arcs) identify dependencies • Commands • condor_submit_dag <DAGMan Input File> • condor_q -dag
DAGMan Input File • Required • Jobs names and their corresponding Condor submit description files for each node in the DAG • Dependency description • Optional • Preprocessing and postprocessing before or after job submission • Number of times to retry if a node within the DAG fails
Example DAGMan Input File Job A longhorn.condor Job B morpheus.condor Job C hypnos.condor Job D horizon.condor PARENT A CHILD B C PARENT B C CHILD D Retry C 3
Description Files • longhorn.condor universe = globus executable = /bin/hostname transfer_executable = false globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork output = job.$(cluster).out error = job.$(cluster).err log = job.$(cluster).log Queue • For hypnos, morpheus, and tf004i files replace globusscheduler value appropriately
File Transfer GridFTP • Defines a protocol • Provides GSI authentication, partial file and parallel transfers, etc • Programs • Server: gsiftp - extends FTP with GSI authentication • Client: globus-url-copy
globus-url-copy • globus-url-copy <fromURL> <toURL> • Accepted URL’s • For local files: file:<full path> • For remote files: gsiftp://<hostname><path> • Also accepts http://, ftp://, https:// • Example globus-url-copy file:/paci/sdsc/ux444444/myfile \ gsiftp://longhorn.tacc.utexas.edu/~/newfile
gsiscp • Not GridFTP-based • Uses GSI authentication • Specify GSISSH server port for single sign-on capabilities • Example • gsiscp –P 1022 setup* \ ux444444@morpheus.engin.umich.edu:.
Resource & Discovery Services • Publishes system and application data • Components • Globus MDS - Monitoring and Discovery Services • Ganglia – For clusters • NWS – Network monitoring • Useful for grid middleware • Resource discovery & selection • Useful for grid applications Configuration and real-time adaptation
Graphical MDS Views • On the web: • https://hotpage.npaci.edu/ • https://hotpage.npaci.edu/cgi-bin/grid_view.cgi • http://npackage.cs.ucsb.edu/ldapbrowser/login.php • Download and run your own LDAP browser • i.e. http://www.iit.edu/~gawojar/ldap/ • NPACI Grid MDS Info • LDAP Host: giis.npaci.edu • Port: 2132 • Base DN: Mds-Vo-name=npaci,o=Grid
Future Work • New NPACKage components • GridPort (ready for next release) • Netsolve (in progress) • NPACI Alpha Project Integration • MCELL, Telesciences, Geosciences, Protein Folding are all in progress • Scalable Viz, PDB Data Resource, Computational Electromicroscopy coming soon
Grid Consulting • Services • Assist with troubleshooting • Evaluate your application for use on the grid • Assist with porting • We are actively looking for applications! Contact us: http://npacigrid.npaci.edu/contacts.html