210 likes | 466 Views
Never Lose a SAS Job. Not Again!!. Unexpected re-boot, system failures Long running job didn’t complete Must manually re-start job from step 1. It can drive you crazy!!!. SAS Grid Gets the Stars Aligned. SAS checkpoint-restart features + LSF requeue capabilities
E N D
Not Again!! • Unexpected re-boot, system failures • Long running job didn’t complete • Must manually re-start job from step 1 It can drive you crazy!!!
SAS Grid Gets the Stars Aligned... SAS checkpoint-restart features + LSF requeue capabilities + SASGSUB batch submission utility --------------------------------------------------- Completion of SAS Jobs in Minimal Time Ideal for critical long-running SAS jobs
SAS Checkpoint/Restart Checkpoint mode Record info about data/proc steps in checkpoint library Restart mode Global statements and macros re-executed SAS reads data in checkpoint library to determine which steps completed Program execution resumes with step that was executing when failure occurred Data/proc steps that completed successfully will not be re-executed
To Set Up for Checkpoint-Restart • Specify following options on batch SAS invocation: • STEPCHKPT – enables checkpoint mode • STEPRESTART – causes SAS to use checkpoint-restart data • NOWORKINIT – does not init WORK library when SAS starts • NOWORKTERM – saves WORK library when SAS exits • ERRORCHECK STRICT – puts SAS in syntax check mode when error in libname, filename, %include and lock stmts • ERRORABEND – causes SAS to terminate for most errors
The WORK Directory • WORK is default location for checkpoint library • Can use STEPCHKPTLIB to point to permanent library • Must include libname as first statement in batch program • WORK directory must be on shared storage • Example: • sas92 -noworkinit -noworkterm -work abc
Use of Both STEPCHKPT and STEPRESTART • Initial invocation • Results in checkpoint mode only • No data in checkpoint library • Subsequent invocations • Uses data from checkpoint library • Continues checkpoint mode for remainder of program
SAS Grid Manager – Queues HOST A SAS Application Normal Queue SAS Grid Manager HOST B HOST C
Automatic Job Requeue Configure queue to automatically requeue job with specific exit value REQUEUE_EXIT_VALUES=all ~0 ~1 Any exit code other than 0 or 1 (success & warnings) will be requeued REQUEUE_EXIT_VALUES=EXCLUDE(all ~0 ~1) Run requeued job on different host Jobs requeued 5 times by default MAX_JOB_REQUEUE lets you configure requeue limit, can be globally specified for all queue or on per queue basis
Automatic Job Rerun A job is automatically rerun when Execution host becomes unavailable while a job is running System fails while a job is running RERUNNABLE=yes
LSF Queue Definition Jobs dispatched from this queue will be rerun if system failures Begin QueueQUEUE_NAME = sas_rerunPRIORITY = 40NICE = 10RERUNNABLE = YESREQUEUE_EXIT_VALUES = all ~0 ~1DESCRIPTION = Jobs submitted to this queue will be requeued automatically and also rerunnable.End Queue Jobs with fatal exit code will be requeued
SASGSUB Capabilities Standalone utility that will allow user to Submit SAS program to grid for processing Display status of user’s jobs on the grid Retrieve output from user’s jobs to local directory Kill jobs
Using SASGSUB Advantages Submit and forget View job output while job is running Eliminate need for full SAS install on client Make use of SAS checkpoint/restart capability NOTE - requires shared file system between client and grid
Submitting a Job Command line interface sasgsub –gridsubmitpgm <sas_pgm> Example output Job ID: 6772 Job directory: "/CNT/sasgsub/gridwork/sascnn1/SASGSUB-2009-03-17_14.09.52.847_testPgm" Job log file: "/CNT/sasgsub/gridwork/sascnn1/SASGSUB-2009-03-17_14.09.52.847_testPgm/testPgm.log“
Submitting a Job for Checkpoint-Restart GRIDRESTARTOK Automatically adds the following options to batch SAS invocation STEPCHKPT, STEPRESTART, ERRORCHECK STRICT, ERRORABEND, NOWORKINIT, NOWORKTERM Sets RERUNNABLE parm on job Command line interface sasgsub –gridsubmitpgm <sas_pgm> -gridrestartok
Getting Job Status Current Job Information Job 1917 (testPgm) is Finished: Submitted: 08Dec2008:10:28:57, Started: 08Dec2008:10:28:57 on Host d15003, Ended: 08Dec2008:10:28:57 Job 1918 (testPgm) is Finished: Submitted: 08Dec2008:10:28:57, Started: 08Dec2008:10:28:57 on Host d15003, Ended: 08Dec2008:10:28:57 Job 1925 (testPgm) is Submitted: Submitted: 08Dec2008:10:28:57 • Command line interface • sasgsub –gridgetstatus <job_id | _ALL_> • Example output
Retrieving Results Command line interface sasgsub –gridgetresults <job_id | _ALL_> Example Output Current Job Information Job 1917 (testPgm) is Finished: Submitted: 08Dec2008:10:53:33, Started: 08Dec2008:10:53:33 on Host d15003, Ended: 08Dec2008:10:53:33 Moved job information to .\SASGSUB-2008-11-21_21.52.57.130_testPgm Job 1918 (testPgm) is Finished: Submitted: 08Dec2008:10:53:33, Started: 08Dec2008:10:53:33 on Host d15003, Ended: 08Dec2008:10:53:33 Moved job information to .\SASGSUB-2008-11-24_13.13.39.167_testPgm Job 1925 (testPgm) is Submitted: Submitted: 08Dec2008:10:53:34
Putting It All Together HOST A normal queue SAS Application SAS Grid Manager HOST B sas_rerun queue HOST C
Putting It All Together HOST A normal queue SAS Application SAS Grid Manager HOST B sas_rerun queue HOST C
A simple solution • Record a checkpoint number, save it in WORK • If restarting, skip PROC / DATA steps to there • Tokenize everything • Execute all global statements