280 likes | 291 Views
NEEShub Simulation Capabilities February 17, 2012 Webinar http://nees.org/resources/4079 George E Brown, JR. Network for Earthquake Engineering Simulation Gregory Rodgers Ph.D. NEESComm IT Purdue University, West Lafayette, IN Post-webinar updates. Webinar Introduction. Audience
E N D
NEEShub Simulation Capabilities February 17, 2012 Webinar http://nees.org/resources/4079 George E Brown, JR. Network for Earthquake Engineering Simulation Gregory Rodgers Ph.D. NEESComm IT Purdue University, West Lafayette, IN Post-webinar updates
Webinar Introduction Audience Simulation tool developers and NEES power users who: • have a very large simulations or many simulations in excess of 30 minutes run time. • need to script parameter sweeps • run a structure analysis with a large suite of ground motions. Prerequisite An understanding of command line interfaces such as Linux bash Summary • This webinar will introduce advanced users to new NEEShub capabilities in the area of simulation and batch processing. Power users often write a script to orchestrate a set of simulation runs to cover many different test cases. Recent batch processing services have been added to NEEShub to make this easy and to provide access to large scratch space. Upon completion of this webinar, a user will be able to write scripts to submit one or more jobs to multiple execution venues to utilize high performance computing resources available to NEES
Agenda HOUR 1 • How Simulation fits into the NEES CyberInfrastructure • Introduction to the linux workspace tool on NEEShub • Manual execution (command line) of applications. • Manual execution (command line) of the opensees simulator • Use of the new batchsubmit command to run opensees • Use of batchsubmit to run other applications • The batchstatus command • Demonstration of how HOME directory space is linked to scratch space. • Advanced batchsubmit options and scripting the execution of batchsubmit. HOUR 2 (advanced) • How to build a bash command file including editors available on Linux • Simple parallel execution (The --ncpus argument to batchsubmit) • Parallel opensees (how to modify sequential input to be parallel input) • How to use batchsubmit for other venues. • Overview of various NEES execution High Performance Computing (HPC) venues: They are local hub execution, osg, hansen, steele, kraken, and ranger, • How does the openseeslab user interface use batchsubmit • Advanced batchsubmit options review • Scratch cleanup algorithm
NEES Cyber Infrastructure NE Experiment Data site/personal data Scratch Space D. The NEES Project Warehouse • Site Operations • Tools Synchronees PEN Custom WS tools Web Browser NEES Web Services Server Personal Space Group Space B. NEESHub Web Server Hub Tool Sessions Project Editor Resources Collaboration E. EOT F. Spreadsheet DBs C. Cloud / Simulation Environment NSF Xsede Open Science Grid Purdue Hansen
Introduction to the linux workspace tool on NEEShub. • Start a workspace from this page http://nees.org/resources/workspace Click “Launch” • You must be part of a special group. If you are not in this group, open a ticket stating that you need workspace access, provide justification, and we will add you to the group. • The window can be resized and popped out of the browser. • Multiple terminals can be opened in the same window. • A workspace session is persistent. You can leave the browser and you can get back to existing workspace from myneeshub page at any time. • This session can also be shared with other users or administrators.
Execution of Applications from the command line • Simple utilities date Print the date env List environment variables ls (and ls –l) Show list of files (long list) cd Change working directory pwd Show working directory mkdir make a directory rm (rmdir) Remove a file (directory) cat Write contents of file on the screen cp Copy a file man <command> Show help about a command (man pages) exit Teminate your session Use Arrow keys get previous commands
Putting commands in a script file • A list of commands can be put into a script • Avoid retyping • Loop through commands • To make executable use command: chmod 755 <filename> • 3 important scripting languages to consider bash linux commands, also csh Tcl/Tk The language for opensees Python Advanced high performance language
Manual Execution of OpenSees • Openseestcl prompt verses Linux command prompt • Start opensees with a tcl prompt (no argument) • Start opensees to execute a file of tcl commands. ( one argument) • The binary OpenSees verses the wrapper shell called opensees opensees <input TCL file> The spelling of the OpenSees binary is OpenSees, but opensees is a wrapper to call OpenSees that sets up the environment correctly.
High Volume Batch on NEEShub • Consistent and asynchronous submission to multiple venues: local, osg, steele, hansen, kraken. The last three are part of the new xsede system that replaces Teragrid. • Asynchronous: job is submitted without waiting for job to complete before returning control to submitter. • Your run directories in $HOME/scratch will be symbolic links to a large (>30TB) shared space. Runs will be compressed or purged with a cleanup algorithm as needed. • Only user will have access to run directories
batchsubmit • The batchsubmit command is a wrapper around any command to execute an asynchronous batch job. batchsubmit <batchsubmitoptions> command <command options> • batchsubmitoptions begin with a double dash. • batchsubmit prints one line of output: the name of the newly created directory where BOTH job input is located and output will be found. • The help for batchsubmit gives an example of how to run opensees batchsubmit –h batchsubmit –h | more batchsubmit date batchsubmit opensees /apps/demo/sine/sine.tcl batchsubmit –appdir /apps/openseesbuild/osg OpenSees /apps/demo/sine/sine.tcl batchsubmit –jo btype sine –onlyinfile opensees /apps/demo/sine/sine.tcl
Input Processing • Default: The first argument after the application command is considered an input file. All files from this directory are copied to the scratch run directory. Two other options: --onlyinfile Only copy the input file --rcopyindir Recursive copy all files and directories from the same directory as the inputfile. • Note: input file not allowed to be home directory unless –onlyinfile specified. You should create a directory for your opensees tcl file. Recommend a dir for each simulation. • What if you have an application where the first argument is NOT the input file (unlike opensees)? --infilearg Indicates which argument is the input file --infile Use this file as an input file where this file is implied by application command hence not one of its arguments.
batchsubmit files/dirs • Job input exists in new scratch directory upon completion of the batchsubmit command. One scratch directory for each batchsubmit command (each job). The directory name has this template. $HOME/scratch/<jobtype>/<jobname>/ • Job output exists when the job is completed. • You will get an email when job starts and when job completes unless you specify –nonotify • Review the various output files generated in a job run directory <jobname>.stdout Standard output. What would be printed to screen <jobname>.stderr Standard error. The run directory Same directory name where the input file was found. Note: your input file is in this directory. @STATUS joblog Interesting info about the environment job was run .log Statistics recorded about this job .born_on_date Used for scratch cleanup.
Job lifecycle • System uses the file @STATUS to store the job status. • States: Presubmit – only for remote venues Submitted – Waiting to start. Only for remote venues. Started – application is running. Remote venues will actually update this file Completed – All results are returned. Deleted – Job has been removed from the shared scratch space but your scratch directory still shows it. Saved – Job was moved to your HOME directory. Symbolic link to shared scratch space is gone. Job is taking up your quota when it is saved.
batchstatus and batchcancel • Other batchsubmit utilities batchstatus – shows the status of each of your jobs. batchcancel – Cancel a job. This command is not released yet. batchsave – Remove a job from scratch space and save it to your HOME directory space.
NEEShub Disk Space NE workstation data Scratch Space • NEEShub data locations: HOME space Groups Space Scratch space Warehouse • Use of synchronees to upload and download between your workstation and NEEShub spaces • Advice: Use relative names for input and output files so your job can run on venues other than “local” /home/neeshub/<youruserid> /data/groups/<groupname> $HOME/scratch /nees/home/<PROJ-DIR> webdav Synchronees HOME Space Group Space batchsubmit
Advance batchsubmit options --wait Only for venues local and osg. This option will hold the completion of batchsubmit until the job is COMPLETED. Standard output and standard error will be printed on the screen. --appdir A Directory containing the application with bin and lib subdirs. The app_command must be in appdir/bin subdirectory. Both bin and lib directories are sent to execution machine for every run. So be careful not to specify a large installed application directory. This option eliminates need to install apps on venues other than local. See your application provider. --envars List of environment variables separated by commas. Only specify names here, values must be set before calling the batchsubmit command thus allowing special characters. For local execution, all environment variables are commuted.
HOUR 2 Agenda • Simple parallel execution (The --ncpus argument to batchsubmit) • Parallel opensees (how to modify sequential input to be parallel input) • How to use batchsubmit for other venues. • Overview of various NEES execution High Performance Computing (HPC) venues: Local Use for testing small jobs less than 4 hours ncpus<16 osg Use for many moderate size jobs. ncpus=1 hansen Use for large parallel jobs ncpus<=48 Steele Use for many parallel jobs ncpus<=8 kraken and ranger (pending) • Advanced batchsubmit options and scripting the execution of batchsubmit. • Building bash scripts to save typing. • Scratch cleanup algorithm
Simple Parallel Execution --ncpus <value> • The above options will cause your application command to execute <value> times in parallel. Example: batchsubmit –ncpus 4 date • What good is it to run the same thing ncpus times? None, unless your application is aware that it is running in parallel. • A parallel aware application will only do 1 Nth the amount of work, knowing that the other processors will do the other parts of the work. • It is not hard to make your application become parallel aware especially with a scripting language like TCL.
Simple Parallel Execution Example: Run the same model through 27 ground motions. We want to divide the ground motions among 8 processors, PID = 0..7 P0 P1 P2 P3 P4 P5 P6 P7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 /* If PID is processor number, then this can be run on all 8 processors */ For count = 0 to 26 if (count % 8) == PID then /* % gets remainder from division */ Do analysis for ground motion #count. else skip end
set pid [getPID] set numP [getNP] set count 0; source ReadRecord.tcl set g 384.4 foreach scaleFactor {0.25 0.5 0.75 1.0} { foreach gMotion [glob -nocomplain -directory GM *.AT2] { if {[expr $count % $numP] == $pid} { source model.tcl source analysis.tcl set ok [doGravity] loadConst -time 0.0 if {$ok == 0} { set gMotionName [string range $gMotion 0 end-4 ] ReadRecord ./$gMotionName.AT2 ./$gMotionName$scaleFactor.dat dT nPts timeSeries Path 1 -filePath $gMotionName$scaleFactor.dat -dt $dT -factor [expr $g*$scaleFactor] if {$nPts != 0} { recorder EnvelopeNode -file $gMotionName$scaleFactor.out -node 3 4 -dof 1 2 3 disp doDynamic $dT $nPts file delete $gMotionName$scaleFactor.dat if {$ok == 0} { puts "$gMotionName with factor: $scaleFactor OK" } else { puts "$gMotionName with factor: $scaleFActor FAILED" } } else { puts "$gMotion - NO RECORD" } } wipe } incr count 1; } } Yellow highlighted code is possible in OpenSeesMP You can remove the yellow and run in OpenSees But it will take much longer. The value of numP will be the –ncpus value provided to batchsubmit
How to use batchsubmit for other execution venues --venue hansen | steele | osg Future values will include kraken and ranger. Note: The batchsubmit options --nn and --ppn are not yet functional. In the future, this will allow extremely large values of --ncpus. -- ncpus will be the product of --nn and --ppn. --mpiargs This option specifies additional arguments to mpirun. Wrap these arguments in single quotes. Typically no additional arguments mpi agruments are needed.
Sample parallel jobs To save typing, I created the following scripts /apps/demo/bin/ex1 /apps/demo/bin/ex2 The above will just print the batchsubmit examles but not run them. The following scripts will print and run the commands /apps/demo/bin/ex1.sh /apps/demo/bin/ex2.sh Lets take time to study these examples.
Venue guidelines Venue Guidance --ncpus --------- -------------------------------------- ------------- local Use for testing small jobs less than 4 hours --ncpus<16 osg Use for many moderate size jobs. --ncpus=1 hansen Use for large parallel jobs --ncpus<=48 steele Use for many parallel jobs --ncpus<=8 • Future venues to include kraken and ranger. • Xsede (formerly teragrid) venues are steele, kraken, and ranger. Xsede and hansen use PBS for job submission. PBS jobs submission is automated by batchsubmit. • This batchsubmit option can change the pbs queue --xdqueue The default queue for steele is "standby". The default queue for hansen is "nees".
Advanced batchsubmit options --jnpref Job name prefix, default is "job". Blanks not allowed Environment variable JNPREF will also override this. Try "export JNPREF="run_" before batchsubmit. --jobname Specify jobname and override autoincrement generated jobname. Recommend not to use this to avoid jobname collisions. However, if a collision occurs with an existing scratch dir, batchsubmit will create a new directory. --xdqueue Queue for xsede machines (steele or hansen) The default queue is "standby". The default queue for hansen is "nees".
Building bash scripts • Commands can be stored in a file and these files can be executed • File can be “ sourced” or executed. • Recommend you store your personal scripts in $HOME/bin • Text Editors available on NEEShub gedit nano vi
Scratch Cleanup Algorithm • IFused< 75% THEN EXIT report no activity required • Delete all jobs > 1yr old , log action • FOR ACTION = compress, archive (phase1 , phase2) • | FOR T=6m, 5m,4m,3m,2m,4w,3w (pass 1, pass 2, …) • | | FOR X= 5,10,20,40,ALL • | | | Calculate set of top X users of scratch space • | | | FOR SIZE=128G,32G,8G,2G,512M,128M,32M • | | | | FOREACH rundirectory • | | | | | IF rundirectory size>SIZE AND • | | | | | rundirectory is owned by X AND • | | | | | rundirectory lifetime >T • | | | | | THEN ACTION rundirectory, log action • | | | | IF used < 50%THEN • | | | | EXIT report SIZE,X,T,A thresholds • IFused > 50%THEN report policy failure and revise policy Values in red are policy parameters that can be revised by management as needed
GC Algorithm Lemmas • No jobs < 3 weeks old will ever be deleted or compressed without a policy change. • Very small jobs < 32MB compressed will be never be deleted by the system. • Worst case: 500,000 32MB jobs would consume 50% of a 32TB scratch. • Process largest to smallest jobs for a fixed set of users and older than a specific age (inner loop) • Process sets of large users with jobs older than a specific age (middle loop) • Outer loop • Pass 1 Process jobs > 6months • Pass 2 Process jobs > 5 months … • No jobs are deleted until all jobs >3 weeks old and > 32MB have been compressed. Compression is phase 1, deletion is phase 2. • Example report stream: • Day1 : No activity, 74% used • Day2 : >2GB,Top 10 users, >3 months old, compressed 50% used • Day3 : >32MB, All users, > 3 weeks old, compressed 50% used (closest call to deletion) • Day4 : > 32GB, Top 5 users, >2 months old, deleted 45% used • Day5 : No activity, 65% used • Day X-1 : >2GB, All users, > 3 weeks old, deleted 50% used (close to policy failure) • DayX : >32M, All users, > 3 weeks old deleted, 60% used , POLICY FAILURE Policy parameters need adjustment
Topics Not covered in this webinar • Use of batchsubmit to build User Interface • Use of pegasus for workflow management • This is in development and test. A single pegaus job can submit many jobs that have inter-job dependencies. • Creation of appdir for portable applications. Only functional appdir today is /app/openseesbuild/osg • Modification of OpenSees source to create personal copy of OpenSees with custom materials and models. • Process in development with Prof. Elwood’s graduate student.