120 likes | 132 Views
Operational Scripts for Reconstructing Run7 Min Bias PRDFs at Vanderbilt’s ACCRE Farm. Charles Maguire et (beaucoup d’) al. http://www.hep.vanderbilt.edu/~maguirc/Run7/phenixVanderbiltMinBias.html (web site updated last week for main control script documentation, cautoRun7).
E N D
Operational Scripts for Reconstructing Run7Min Bias PRDFs at Vanderbilt’s ACCRE Farm Charles Maguire et (beaucoup d’) al. http://www.hep.vanderbilt.edu/~maguirc/Run7/phenixVanderbiltMinBias.html(web site updated last week for main control script documentation, cautoRun7) Local Group Meeting
Overview of the Five Script Functions • FDT transfer of PRDFs from firebird to ACCRE • All firebird PRDF scripts are in the /mnt/eon0/bnl/control directory (or eon1) • All ACCRE PRDF scripts are in the /home/phnxreco/prdfTransfering directory • Database updating (new as of May 30) • Submission of PBS jobs for reconstructing PRDFs • FDT/gridFTP transfer of outputs to RCF • All firebird nanoDST scripts are in /mnt/eon0/bnl/control directory (or eon1) • All ACCRE nanoDST scripts are in /home/phnxreco/nanoDstTransfering • Scripts for monitoring the above functions • All scripts are committed in CVS • http://www.phenix.bnl.gov/viewcvs/online/vanderbilt/run7AuAu/ • 113 entries (4 added since last week for gridFTP monitoring) Local Group Meeting
FDT Transfer of PRDFs to ACCRE • Computer nodes involved • vpac15 does hourly checks of the firebird eon0 and eon1 disks (postgres account) • firebird acts as the FDT server node (bnl account) • vmps01 acts as the FDT client (phnxreco acccount) • CRON scripts on vpac15(vpac15 chosen because vupac nodes can do emails) • 15 * * * * /var/lib/pgsql/monitoring/inputFireBirdOccupancyEon0.pl >& /var/lib/pgsql/monitoring/inputFireBirdOccupancyPerlEon0.log & • checks the percent occupancy of the eon0 disk every hour • a corresponding cron job at 20 minutes after the hour checks eon1 • 25 * * * * /var/lib/pgsql/monitoring/inputFireBirdNewFilesEon0.csh >& /var/lib/pgsql/monitoring/inputFireBirdNewFilesEon0.log & • checks whether there are new PRDF files to be transferred from eon0 • The script calls /var/lib/pgsql/monitoring/inputFireBirdNewFilesEon0.pl on vpac15 • a corresponding cron job at 30 minutes after the hour checks eon1 • All four of these scripts will issue e-mails on what they have found • Exact details of the scripts will be posted on the WWW, just as for cautoRun7 • Scripts on the firebird node • inputFireBirdOccupancyEon0.csh called by /var/lib/pgsql/monitoring/inputFireBirdOccupancyEon0.pl • Script returns the % occupancy on the eon0 buffer disk, used by perl script to send an e-mail to me • inputStatusCheckEon0.csh called by vpac15 ../inputFireBirdNewFilesEon0.pl • This script calls the inputStatusCheckFilesEon0.pl script in the /mnt/eon0/bnl/control directory • The inputStatusCheckEon0.pl determines if there are PRDFs to be transferred • If there are PRDFs to be transferred an fdtPRDFServerEon0.csh script starts on firebird, andan fdtPRDFClientEon0.csh client starts on vmps01 at ACCRE Local Group Meeting
FDT Transfer of PRDFs to ACCRE • Computer nodes involved (repeated from previous slide) • vpac15 does hourly checks of the firebird eon0 and eon1 disks (postgres account) • firebird acts as the FDT server node (bnl account) • vmps01 acts as the FDT client (phnxreco account) • Scripts on the firebird node (repeated from previous slide) • inputFireBirdOccupancyEon0.csh called by /var/lib/pgsql/monitoring/inputFireBirdOccupancyEon0.pl • Script returns the % occupancy on the eon0 buffer disk, perl script send an e-mail to me • inputStatusCheckEon0.csh called by vpac15 ../inputFireBirdNewFilesEon0.pl • This script calls the inputStatusCheckFilesEon0.pl script in the /mnt/eon0/bnl/control directory • The inputStatusCheckEon0.pl determines if there are PRDFs to be transferred • If there are PRDFs to be transferred an fdtPrdfServerEon0.csh script starts on firebird, andthis script calls the fdtStartPrdfFClientEon0.csh script with parameter on vmps01 at ACCRE • Scripts on the vmps01 node • /home/phnxreco/prdfTransfering/fdtPrdfClient.csh actual copies PRDF files to the /gpfs3 areathis script is started by the fdtStartPrdfClientEon0.csh script using input after a 15 second delay • After the fdtPrdfClientEon0.csh finished copying it exits by calling the confirmTransferAndThenEraseEon0.pl script • The confirmTransferAndThenEraseEon0.pl verifies that all the PRDF files have been copied correctly to the /gpfs3 area. If so, the PRDF files are deleted from the eon0 area on firebird • The confirmTransferAndThenEraseEon0.pl script on vmps01 calls the /mnt/eon0/bnl/control/haveBeenTransferredList.ch script on the firebird nodeto get a list of files which were supposed to have been transferredThe haveBeenTransferredEraseEon0.csh script will be called to do the actual file deletion on firebird Local Group Meeting
FDT Transfer of PRDFs to ACCREInventory and Location of PRDF Files at ACCRE • Computer nodes involved (repeated from previous slides) • vpac15 does hourly checks of the firebird eon0 and eon1 disks (postgres account) • firebird acts as the FDT server node (bnl account) • vmps01 acts as the FDT client (phnxreco account) • Three locations at ACCRE for files copied from firebird (important fact) • 17 TBytes at /blue/phenix/RUN7PRDF/auauMinBias200GeV (ITS Blue-Arc platform) • Now 95% full, no more to be added • The /blue/phenix area is the current input file source area for the cautoRun7 scripts • 20 TBytes at /gpfs3/RUN7PRDF/auauMinBias200GeV (ACCRE owned disks) • Now 70% full, current destination area firebird -> ACCRE • The /gpfs3/RUN7PRDF area is also the top directory for the output files • 8 TBytes at /gpfs2/RUN7PRDF/auauMinBias200GeV (ACCRE owned disks) • Now 80% full, original destination area firebird -> ACCRE before /gpfs3 • All the RUN7 PRDF files on /gpfs2 have been copied to /blue/phenix • The /gpfs2 area is used for simulation project output as well • Major action item to be done • We don’t yet monitor the /gpfs3 occupancy • At some point we will have to delete PRDFs to make room for new PRDFsor ask for more disk space Local Group Meeting
Database Updating • Sequence of database updating • cron job on rftpexp01 with maguire account runs at 00:05 EDT • Uses gridFTP to deliver 3 restore files to firebird /mnt/eon0/rhic/databaseUpdating • Second cron job at 01:05 confirms that the transfers were successful • cron job on firebird rhic account acts as FDT server process to vpac04 at 11:35 CDT • cron job on vpac04 postgres account acts as FDT client process 30 seconds later • Client job first removes any restoreFilesAlreadyUsed signal file from /rhic2/pgsql/dbRun7 • Client job makes a /rhic2/pgsql/dbRun7/restoreFilesTransferInProgress signal file • FDT copies 3 restore files files to /rhic2/pgsql/dbRun7 • After copy is completed a restoreFilesTransferCompleted signal file is created • Client job deletes the /rhic2/pgsql/dbRun7/restoreFilesTransferInProgress signal file • vpac15 cron job runs startRestoreFromDumpsABC.csh every hour (except 6 - midnight) • startRestoreFromDumpsABC checks to see if it is OK to start at restore job • If it is OK to start a restore job then the RestoreFromDumpsABC.csh script is run • Operation of the RestoreFromDumpsABC.csh script • Checks which cycle among A, B, C is to be updated (e.g., daq_A or daq_B or daq_C) • After update is complete, a new .odbc.ini is copied to phnxreco account on ACCRE • Similarly, new versions of the checkcalib and checkrun are also copied for cautoRun7 Local Group Meeting
Submission of PBS Jobs to do Reconstruction • Submission of PBS jobs for reconstructing PRDFs • Controlled by the cautoRun7 master script, e.g. submit 200 jobs • www http://www.hep.vanderbilt.edu/~maguirc/Run7/cautoOperations.html • Script is launched every 30 minutes, at 5 and 35 minutes after the hourusing a phxreco cron job on vmps18 (nothing important about vmps18) • General outline of the operations for cautoRun7 • Check if OK to launch 200 new jobs: will not run if jobs already running • Will not run if the output from the previous cycle is not yet at RCF • Checks that database is accessible • Harvests the current production output into a “newstore” area • Checks which jobs succeeded and which failed, removes temporary work areas • Makes a list of run numbers for the next production cycle (complex logic) • Submits a new set of 200 jobs (number 200 is in the submit.pl script) • After all new jobs are submitted, the transfer process to RCF is begun Local Group Meeting
Submission of PBS Jobs to do Reconstruction • Submission of PBS jobs for reconstructing PRDFs (previous slide) • Controlled by the cautoRun7 master script, e.g. submit 200 jobs • Scripts for manual monitoring the reconstruction jobs (last week) • Look in the .cshrc file of the /home/phnxreco account for definitions • Special alias commands (always uppercase letters) • STATPBS (means qstat | grep phnxreco) lists queued jobs (running and waiting) • SHOWPBS (means showq | grep phnxreco) lists queued jobs, different format • WAITING means /home/phnxreco/prdfTransfering/checkPBSWaiting.csh • Shows jobs which are actually running and those which are waiting • Jobs can be waiting as idle or deferred (lowered priority, complex scheduling priorities) • RUNNINGPBS showq | grep phnxreco | grep -c Running ; showq | grep phnxreco | grep -c phnxreco • Produces three lines of output: completed, running and total jobs in queue • JOBSPBS perl -w /gpfs3/RUN7PRDF/prod/run7/jobStatisticsPBS.pl • Produces detailed summary of the last major job submission • SCANDBFAIL perl -w $CVSRUN7/scanForDBFailures.pl • Used immediately after a 200 job submission • Determines jobs which had initial DB access failures (problem should be fixed now) Local Group Meeting
FDT/gridFTP Transfer of Output to RCF • FDT/gridFTP transfer of outputs to RCF • http://www.hep.vanderbilt.edu/~maguirc/Run7/nanoDSTTransferToRCF.html • The transfer of the output to RCF proceeds in two stages • FDT transfer from ACCRE to firebird disks (eon0 or eon1) • gridFTP transfer from firebird to RCF (using maguire gridFTP certificate) • The process is started at the end of the cautoRun7 script • The gridFTP transfer to RCF of all the output files must succeed beforeany new production jobs are submitted by the next cautoRun7 cycle • FDT transfer ACCRE -> firebird • vmps02 node is used as the server, firebird node is used as the client • The FDT can go to either eon0 or eon1, whichever is less busy/full • ~770 GBytes of output for 200 jobs, FDT at 45 Mbytes/second ===> ~5 hours • gridFTP transfer firebird -> RCF • slower transfer rate ~20 Mbytes/second ===> ~11 hours ===> 16 hours total transfer • 16 hours is well matched to the ~20 hour cycle of the compute jobs • maguire cron job on rftpexp01 monitors for the successful transfer of all files Local Group Meeting
Issues for gridFTP Transfer of Output to RCF • FDT/gridFTP transfer of outputs to RCF (previous slide) • http://www.hep.vanderbilt.edu/~maguirc/Run7/nanoDSTTransferToRCF.html • The transfer of the output to RCF proceeds in two stages: FDT and gridFTP • The gridFTP transfer to RCF of all the output files must succeed beforeany new production jobs are submitted by the next cautoRun7 cycle • gridFTP transfer firebird -> RCF (previous slide) • slower transfer rate ~20 Mbytes/second ===> ~11 hours ===> 16 hours total transfer • 16 hours is well matched to the ~20 hour cycle of the compute jobs • maguire cron job on rftpexp01 monitors for the successful transfer of all files • Unsettled issues for gridFTP transfer to RCF • No fixed, large volume disk area available at RCF for this Run7 reco output • Output and monitoring scripts have to be manually edited for a new RCF disk area • We need to automate the process of selecting an available disk area at RCF • Disastrous slowdown (~1 Mbyte/second) to data59 luckily caught on Saturday morning Slowdown not present for other RCF disks, was able to switch to data58 @ 20 MB/sec Local Group Meeting
Scripts for Monitoring Other Scripts(to be written) • Need scripts to check if the signal files have become too old • fdtInProgress on either eon0 or eon1 (part of PRDF transferring) control areas • fdtInProgress on /home/phnxreco/nanoDstTransfering (< 8 hours) • gridFtpInProgress on /home/phnxreco/nanoDstTransfering (< 15 hours) • gridFtpInProgress on eon0 or eon1 control areas (< 15 hours) • cautoRun7InProgress on /gpfs3/RUN7PRDF/prod/run7 (< 1 hours) • Need script to check if PBS jobs have crashed • If early crash, possibly resubmit • Identify node where the job crashed, and notify ACCRE people Local Group Meeting
Major Action Items to be Done by VU Crew • Write adaptive software for knowing which RCF disk to use for output • Catalog output locations at RCF into the FROG database (Irina) • Also catalog locally what we have already done • Prepare to switch to /gpfs3 as new source input area • /gpfs3 is 65% full now with PRDFs and reconstructed output • We should delete 90% of reconstructed output from /gpfs3 , save 10% • Must be careful to maintain empty files for the makelist script used by cautoRun7 • Must write a new “safeDelete” script for this purpose • Delete already reconstructed PRDFs from /blue/phenix • Can we write these PRDFs to tape/backup (ITS contact)? How much would that cost? • Use /blue/phenix as the destination area for new PRDFs • Develop a WWW site which provides a snapshot of project status • disk space used on eon0, eon1, /gpfs3, /blue/phenix • date and size of the last PRDF transfer from 1008 • number of reco jobs already done, status of current jobs, gridFTP transfer rate • disk situation at RCF, current output destination, next output destination • www site will be looked at by SA2 to determine if there is a problem Local Group Meeting