190 likes | 287 Views
Data reprocessing for DZero on the SAM-Grid. Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division. Overview. The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management
E N D
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division
Overview • The DZero experiment at Fermilab • Data reprocessing • Motivation • The computing challenges • Current deployment • The SAM-Grid system • Condor-G and Global Job Management • Local Job Management • Getting more resources submitting to LCG
Fermilab and DZero DZero
Data size for the D0 Experiment • Detector Data • 1,000,000 Channels • Event size 250KB • Event rate ~50 Hz • 0.75 PB of data since 1998 • Past year overall 0.5 PB • Expect overall 10 – 20 PB • This means • Move 10s TB / day • Process PBs / year • 25% – 50% remote computing
Overview • The DZero experiment at Fermilab • Data reprocessing • Motivation • The computing challenges • Current deployment • The SAM-Grid system • Condor-G and Global Job Management • Local Job Management • Getting more resources submitting to LCG
Motivation for the Reprocessing • Processing: changing the data format from something close to the detector to something close to the physics. • As the understanding of the detector improves, the processing algorithms change • Sometimes it is worth to “reprocess” all the data to have “better” analysis results. • Our understanding of the DZero calorimeter calibration is now based on reality rather then design/plans: we want to reprocess
The computing task • Events 1 Bilion • Input 250TB (250kB/Event) • Output 70TB (70kB/Event) • Time 50s/Event: 20,000months • Ideally 3400CPUs (1GHz PIII) for 6mths (~2 days/file) • Remote processing 100% • A stack of CDs as high as the Eiffel tower
Data processing model (n~100: files produced in 1 day) Input Datasets (n files) … Site 1 Site 2 Site m … Job 1 Job 2 Job n (n batch processes per site) … (stored locally at the site) Out 1 Out 2 Out n (at any site) Merging Permanent Storage
Challenges: Overall scale • A dozen computing clusters in US and EU • common meta-computing framework: SAM-Grid • administrative independence • Need to submit 1,700 batch jobs / day to meet the dead line (without counting failures) • Each site needs to be filled up at all time: locally scale up to 1000 batch nodes • Time to completion of the unit of bookkeeping (~100 files): if too long (days) things are bound to fail • Handle 250+ TB of data
Challenges: Error Handling / Recovery • Design for random failures • unrecoverable application errors, network outages, file delivery failures, batch system crashes and hangups, worker-node crashes, filesystem corruption... • Book-keeping of succeeded jobs/files: needed to assure completion without duplicated events • Book-keeping of failed jobs/files: needed for recovery AND to trace problems in order fix bugs and to assure efficiency • Simple error recovery to foster smooth operations
Available Resources SITE #CPU 1GHz-eq. STATUS FNAL Farm 1000CPUs used for data-taking Westgrid 600CPUs ready Lyon 400CPUs ready SAR (UTA) 230CPUs ready Wisconsin 30CPUs ready GridKa 500CPUs ready Prague 200CPUs ready CMS/OSG 100CPUs under test UK 750CPUs 4 sites being deployed ------------------------------------------------------------- 2800CPUs (1GHz PIII equiv.)
Overview • The DZero experiment at Fermilab • Data reprocessing • Motivation • The computing challenges • Current deployment • The SAM-Grid system • Condor-G and Global Job Management • Local Job Management • Getting more resources submitting to LCG
The SAM-Grid • SAM-Grid is an integrated job, data, and information management system • Grid-level job management is based on Condor-G and Globus • Data handling and book-keeping is based on SAM (Sequential Access via Metadata): transparent data transport, processing history, and book-keeping. • …lot of work to achieve scalability at the execution cluster
data meta-data job Flow of: User Interface User Interface User Interface User Interface Submission Submission Global Job Queue Resource Selector Global DH Services Grid Client Match Making SAM Naming Server Info Gatherer SAM Log Server Info Collector Resource Optimizer Site Cache MSS Cluster Data Handling Local Job Handling Info Manager Grid Gateway SAM Station (+other servs) SAM DB Server MDS Web Serv Grid/Fabric Interface RC MetaData Catalog SAM Stager(s) Info Providers Bookkeeping Service Grid Monitoring JIM Advertise XML DB server User Tools Worker Nodes Dist.FS Site Conf. Glob/Loc JID map AAA Site Site Site ... SAM-Grid Diagram
Job Management Diagram User Interface User Interface Match Making Service Match Making Service Resource Selection ext. algo Submission Service Submission Service Information Collector Information Collector JOB Exec Site #1 Execution Site #n Grid/Fabric Interface Grid/Fabric Interface Grid/Fabric Interface Generic Service Generic Service Generic Service Computing Element Computing Element Computing Element Grid Sensors Grid Sensors Grid Sensors Grid Sensors Computing Element
Fabric Level Job Management JOB Execution Site Job stores output from application Job enters the Site Local Sandbox created for job(user input, configuration, SAM client, GridFTP client, user credentials) Local services notified of job Batch Job submission details requested Job submitted Stdout, stderr, logs handed to Grid Push of monitoring info starts Job fetches Sandbox Job gets dependent products and input data Framework passes control to application Grid monitors status of job Job starts on Worker node User requests status of job Grid/Fabric Interface Grid/Fabric Interface Batch System Adaptor Batch System Adapter XML Monitoring Database XML Monitoring Database SAM Station SAM Station Sandbox Facility Sandbox Facility Batch System Batch System Worker Node Worker Node
How do we get more resources? • We are working on forwarding jobs to the LCG Grid • A “forwarding-node” is the advertised gateway to LCG • LCG becomes yet another batch system… well, not quite a batch system • Need to get rid on the assumptions on the locality of the network Fwd-node SAM-Grid LCG VO-srv
Conclusions • DZero needs to reprocess 250 TB of data in the next 6-8 months • It will produce 70 TB of output, processing data at a dozen computing centers on ~3000 CPUs • The SAM-Grid system will provide the meta-computing infrastructure to handle data, job, and information.
More info at… • http://www-d0.fnal.gov/computing/reprocessing/ • http://www-d0.fnal.gov/computing/grid/ • http://samgrid.fnal.gov:8080/ • http://d0db.fnal.gov/sam/