220 likes | 337 Views
D Ø MC and Data Processing on the Grid. Brad Abbott University of Oklahoma D0SAR Sept 21, 2006. Computing at D Ø. Provide the necessary resources for primary processing of data, reprocessing, fixing, skimming, data analysis, MC production, data handling, data verification…
E N D
DØ MC and Data Processing on the Grid Brad Abbott University of Oklahoma D0SAR Sept 21, 2006
Computing at DØ • Provide the necessary resources for primary processing of data, reprocessing, fixing, skimming, data analysis, MC production, data handling, data verification… • Provide this in a timely manner to allow researchers to analyze data in efficient manner.
Challenges • Collecting data at ~ 50 events/sec. • Processing time is ~ 70 GHz-sec event • ~ 900 CPU’s on DØ farm running 24/7 to keep up with data • Need Millions of Monte Carlo events • Store data to tape and allow easy access (SAM) • Have ability to reprocess, fix, data in timely manner. • Provide computing resources to analyzers
Local Facilities • 70 TB of project disk CluedØ/CAB • CAB • 2.2 THz of CPU (comparable to the FNAL production farm) • 235TB of SAM Cache • More CPU/Disk on order • CluedØ • An incredible resource by the people for the people! • 1+ THz • SAM Cache • 70 TB (nodes) + 160 TB (servers)
Monday Report August 14, 2006, Typical week Usage What does a typical week look like? ANALYSIS STATIONS data analyzed events projects clued0 15.09T 402M 646 fnal-cabsrv2 115.51T 2685M 1611 fnal-cabsrv1 85.56T 2358M 985 D0 TOTAL 216.16T 5446M 3242
Analysis over time • Events consumed by station since “the beginning of SAM time” • Integrates to 300B events consumed Cabsrv-Blue, red Clued0-grey
Current Computing Status • Overall very good. • Reconstruction keeping up with data taking. • Data handling working well • Remote sites for MC, reprocessing, processing, fixing • Significant analysis CPU
Future challenges • Larger data sets • Luminosities > 200 E 30 • Increased sharing of manpower with LHC • Reduced manpower for DØ • Tight budgets • Need to use shared resources
Significantly longer to process Computing resources need to Deal with this Previously Need to plan on luminosities of 400 E 30
DØ computing model • Distributed computing, moving toward automated use of common tools on grid • Scalable • Work with LHC, not against, increased resources • Need to conform to standards • DØ running experiment and is taking data. Need to take prudent approach to computing • SAMgrid
SamGrid • SAM: Data Handling • Over 7PB consumed last year • Up to 1 PB/month • SAMGrid: • JIM: Job submission and monitoring • SAM+JIM: SAMGrid • 20 native execution sites • Automated submission to other grids
Progression on Remote Farms • MC data reprocessing processing skimming* analysis* • Facilities: Dedicated farms shared farm OSG/LCG • Automation: Expert regional farmer any user* *Not yet implemented
Data Reprocessing on Grid • Reprocessing of data: 1 Billion events (250 TB from raw) • SAMGrid as default, using shared resources • 3.5 THz for 6 months – Largest such effort in HEP • Refixing: 1.4 B events in 6 weeks • Used SAMGrid, automated use of LCG,OSG • Finished on time. Very successful
Processing on Grid • Prefer not to do primary processing on Grid. • Can do processing on a few select sites that have been well certified (Has been shown, Cable swap data processed at OU) • Certification of Grid is problematic • Do not need to worry about fair-share, availability of nodes etc.
Cable swap data at OU • First time that primary processing performed at a remote site for DØ • Processed 9463 files • Total of 3421.6 GB • Events: 18391876 • Took ~ 3 months. Partly since we only had ~70 of the available 270 CPU’s
MC Production resources • All produced offsite • MC less stringent, i.e. can always make more • Native SAMGrid Producers: CMS-FNAL. Gridka, LTU, LU, MSU, OU(2), SPRACE, TATA, Westgrid, Wuppertal, FZU • Non-SAMGrid: Lyon and Nikhef • LCG -21 CE’s (10 UK, 6 FR, 3NL, 1 CZ, 1 DE) • OSG 8 CE’s ( UNL, IU, Purdue, SPGRID, OCHEP, TOPDAWG, UWM, CMS-FNAL
Monte Carlo • More than 250 Million events produced • Up to 10 million events/week • LCG and OSG • 59% SAMGrid • 80.4%Europe • 15.7% N. America • 3.5% S. America • 0.3% Asia
Current plans • Reprocessing of Run IIB data needed • 300 million events • Takes ~ 80 GHZ-sec/event to process • Expect to need ~ 2000 CPUs for 4 months to reprocess data • Utilize OSG sites much more extensively • SAM v7 (One version of SAM) • Plan on beginning in November
Current plans (cont) • Overall priority is to reduce manpower needs for midterm and long term by assuring additional functionality is quickly developed. First in SAMGrid mode with rapid transfer to automated forwarding nodes. • CAB running as part of Fermigrid • Moving full functionality to the forwarding mechanisms • Automated production of MC with OSG • Sam shifters take over responsibility of submitting jobs • Automated submission to use full power of interoperability/grid resources
Conclusions • DØ computing model very successful • MC and data are continuing to move more toward using Grid resources • LCG has been used more heavily in past but soon OSG will be more heavily utilized • Remote computing critical for continued success of DØ