170 likes | 350 Views
SAM Job Submission. Rod Walker, 10 th May, Gridpp, Manchester. What is SAM? sam submit …… Data Management Details. Conclusions. What is SAM?. SAM is S equential data A ccess via M eta-data Project started in 1997 to handle D0’s needs for Run II data system.
E N D
SAM Job Submission Rod Walker, 10th May, Gridpp, Manchester. • What is SAM? • sam submit …… • Data Management • Details. • Conclusions.
What is SAM? • SAM is Sequential data Access via Meta-data • Project started in 1997 to handle D0’s needs for Run II data system. • Current SAM team includes: • Andrew Baranovski, Lauri Loebel-Carpenter, Gabriele Garzoglio, Chris Jozwiak, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders) • http://d0db.fnal.gov/sam
SAM is a Distributed System Name Server Database Server(s) (Central Database) Global Resource Manager(s) Log server Shared Globally Station 1 Servers Station 3 Servers Local Station n Servers Station 2 Servers Mass Storage System(s) Arrows indicate Control and data flow Shared Locally
Job Submission • Executable • Runtime environment • Executable&assoc. files (user specific). • Experiment environment. • Data • Dataset definition • Select by metadata. • Converted to LFN`s at submit time, ie.datasets change. • Build SQL query…then…execute query.
Job Running & Job Control (Run this exe | on this data) 1. sam submit –defname=mydata –script=myexe 2.submit to SM Job Manager (Project Master) Local SM (Station Master) 3.invoke Client jobEnd 4.submit To BS 7.Started 5.Submission ok 9.setJobCount/stop Process Manager (SAM wrapper script) Batch System User Task 6.start job 8.invoke 10.resubmit
Stager User exe User exe User exe User exe Job control Replica Catalogue PFN LFN Fetch PFN 2 1 Finished Wait 4 3 Release getNextFile() BS Here`s the path to a local file: /sam/cache1/boo/mydata1.dat Physics & wrapper
Data Management • Replica Catalogue • Replication • Cache Management
Replica Catalogue • Combined with Metadata in an Oracle database, although logically distinct • Query on metadata to create a dataset • list of LFN`s • Experiment specific (D0/CDF). • Query on LFN to locate physical file. • Generic replica catalogue. • node:/path/to/cache/myfile.dat
Replica Catalogue 600,000 files increasing at 3000/day, 120TB. 150,000 in cache 5000 files per day replicated, 5000 destroyed. ½ million queries per day, (90% SELECT).
Cache Managment • 13.6TB, in several 100 individually managed caches. • 1TB in and out/day (10k files) • Cache lifetime ~10 days • Various prescriptions for cache replacement, e.g. 1st in, 1st out, last use. 70% hit rate(~6000 files/day)
Replication • Easy – use your favourite ftp. • BUT……what could go wrong. • Cache space – Cache Management. • network, dead node, corrupted file - retries. • dead disk, uncached – fail-over. • sluggish robot, slow delivery – hold job. • A stroll through my log file.
05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: delivery error (Category SAM Internal) Severity level: ERROR Generated on 07 May 16:01:51 by eworker In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp) WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp Recommended action: Please contact sam-admin@fnal.gov05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery failed,scheduling retry in 3 seconds Retry
05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: delivery error (Category SAM Internal) Severity level: ERROR Generated on 07 May 16:02:35 by eworker In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp) WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp Recommended action: Please contact sam-admin@fnal.gov05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Maximum numberof retrials exceeded. Will not retry again from this source!05/07/02 16:02:35 imperial-test.SM.Repler 11698: Will avoid locations:(cab:d0cs015.fnal.gov:/sam/cache/boo)05/07/02 16:02:35 imperial-test.SM.Repler 11698: No loc is preferred,selectingenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all(prl733.24) Give up on this source. Avoid this location. Get another location from RC, and retry.
05/07/02 16:10:53 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: OK (Category Enstore) Severity level: SUCCESS Generated on 07 May 16:10:53 by eworker In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=1369320147LABEL=PRL859LOCATION=0000_000000000_0000067DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=160.38SEEK_TIME=73.47MOUNT_TIME=25.36QWAIT_TIME=65.79TIME2NOW=329.78STATUS=ok STDERR: Completed transferring 1369320147 bytes in 1 files in329.720216036 sec. Overall rate = 3.96 MB/sec. Drive rate = 8.14 MB/sec. Network rate = 8.13 MB/sec. Exit status Got it
05/07/02 15:46:09 imperial-test.SM.PBS BS Adapter 11698: Rememberingthat job 1760.gw39.hep.ph.ic.ac.uk for project 61983_sam_ is held --------------------------05/07/02 16:00:56 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status: Code: OK (Category Enstore) Severity level: SUCCESS Generated on 07 May 16:00:56 by eworker In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=788805399LABEL=PRL829LOCATION=0000_000000000_0000025DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=90.08SEEK_TIME=45.05MOUNT_TIME=27.14QWAIT_TIME=225.50TIME2NOW=392.28STATUS=ok STDERR: Completed transferring 788805399 bytes in 1 files in392.221878052 sec. Overall rate = 1.92 MB/sec. Drive rate = 8.35 MB/sec. Network rate = 8.35 MB/sec. Exit status = 0., method name: samcp Recommended action: Please contact sam-admin@fnal.gov---------------------------05/07/02 105/07/02 16:00:57 imperial-test.SM.PBS BS Adapter 11698: Willexecute: qrls 1760.gw39.hep.ph.ic.ac.uk Hold in queue until 1st file delivered. File arrives Release
Conclusions • Executable is stupid - no knowledge of data transfer. Job manager does the clever stuff. • SAM has a fully featured, tried and tested data management system. • No GSI, GridFTP, or CondorG as yet, …but you need more than G`s to make a grid!