150 likes | 256 Views
McFarm Improvements and Re-processing Integration. D. Meyer for The UTA Team DØ SAR Workshop Oklahoma University 9/26 - 9/27/2003. http://www-hep.uta.edu/~d0race/McFarm/McFarm.html. Reasons for Using McFarm. McFarm is a DØ MC Control Software developed at UTA and used in six farms
E N D
McFarm Improvements and Re-processing Integration D. Meyer for The UTA Team DØ SAR Workshop Oklahoma University 9/26 - 9/27/2003 http://www-hep.uta.edu/~d0race/McFarm/McFarm.html
Reasons for Using McFarm • McFarm is a DØ MC Control Software developed at UTA and used in six farms • Simplifies Monte Carlo Production • Manages the Cluster with Minimum Labor • Manages the Cluster Efficiently • Minimizes Impact of Changes to SAM, mc_runjob, other DØ software • User-Oriented McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
McFarm Software Integration • DØ Binaries - minitars or full release • SAM - declaration, storage, retrieval • mc_runjob - job and metadata construction • NFS - access to binaries, minbias database • NIS - account management • ssh - intra-cluster monitoring and control • Batch Queues - PBS and Condor McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Improvements: Procedural Changes • Request monitoring now provided by McFarm monitor to close requests • “check_sam” now obsolete, replaced by archive daemon & store-verification • Mechanism to handle too-large reco tasks: do just the pythia/d0g/sim (PDS jobs) and let requestor do reco on CAB McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Bug Fixes • event-count now correct when not all events done • available-space correct on NFS-mounted disks (df command) • No longer attempting to patch metadata for bad key-words. • Others McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Enhancements • Two-day grace period for final tmb merge now configurable: FARM_MERGE_GRACE_HOURS • Also FARM_MERGE_MAX_EVENTS and FARM_MERGE_MAX_FILES • Monitor reassurance can be turned off: FARM_MONITOR_REASSURE=‘NO’ McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Enhancements - 2 • FARM_SAM_VERIFY_STORE_HOURS makes storing and purging separate events in McFarm. “Archive” daemon. • SAM store retry improved - will undeclare, cancel-store as necessary • bin/onetime/re-store full-job-dir-name • SAM gather will get to merger files periodically even if busy with regular stores McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Enhancements - 3 • Request life-cycle monitoring handled by McFarm monitor to improve turn-around. • Number of events now in gather.log • execute daemon will detect reco stall due to over-swapping and will kill job. • Execute daemon retains job hist even when stopped/restarted (job.hist file) McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Enhancements - 4 • launch_request accepts PDS (and S) jobs to handle unwieldy reco requests - it stores sim files. • purge_job accepts “--d0phase=mcpNN” argument to purge archives by D0 phase, including merger archives. McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Re-Processing Real Data • Basic approach is to do d0reco binary only, using reprocessing Framework rcp, running on raw or reco file as input. • Joel Snow traced the rcp usage in UMICH job • Mark Sosebee has done sample by hand and analyzed histograms - so far so good • Dave Evans has included re-processing support in version 06-00-02 of mc_runjob McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Re-Processing Real Data - 2 • I have used mc_runjob v06 to manually re-process both MC reco files and raw files. Testing is continuing - presently some problems with metadata declaration. • The bad news: mc_runjob v06 contains substantial changes to job structure and execution that will require days of work to integrate into McFarm and test all code McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Re-Processing Real Data - 3 • Our approach is to have Request_NNNN.py file include a sam dataset definition of files to be re-processed, feed into McFarm just like a Monte Carlo request • Some of McFarm is ready (SAM acquire), some is not (launch_request RT, v06 adaptation, switch from events to files) • Dave Evans is leaving McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Re-Processing Real Data - 4 • Key contents of Request_REPROC02.py: 'Reconstructed':{'datasetdefname':’ reco_14.02.00_raw_2_files’, 'frameworkrcpname':'runD0recoSAM_data_reprocess_p13dst.rcp',}, • launch_request REPROC02 /home/mctest 0 RT • job UTATEST-RT-ReqREPROC02-03265220712 • It runs under mc_runjob v05 / McFarm v10.04, but no proper metadata yet. McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Re-Processing Real DataTo Be Done • mc_runjob v06 must be debugged and released • McFarm must be adapted to v06 and debugged • Metadata must be stabilized and accepted by SAM • Re-processing authority should use MC-like Request_NNNN.py to invoke re-reco McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU
Conclusions • McFarm has morphed significantly since its creation to accommodate • Enhanced error handling • Enhanced monitoring • Other improvements • Re-processing capability in the works, despite some worries on schedule and support • IAC’s use and comments prompted McFarm improvements (Thank you everyone!!) • Comments always appreciated McFarm Improvements; D. Meyer, UTA DØ SAR Workshop, OU