1 / 16

D Ø Data Handling Operational Experience

D Ø Data Handling Operational Experience. Roadmap of Talk. GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London. Computing Architecture Operational Statistics Challenges and Future Plans Regional Analysis Centres Computing activities Summary. Netherlands 50. France 100.

snana
Download Presentation

D Ø Data Handling Operational Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DØ Data Handling Operational Experience Roadmap of Talk GridPP8 Sep 22-23, 2003 Rod Walker Imperial College London • Computing Architecture • Operational Statistics • Challenges and Future Plans • Regional Analysis Centres • Computing activities • Summary

  2. Netherlands 50 France 100 Great Britain 200 DEC4000 DEC4000 DEC4000 all Monte Carlo Production d0ola,b,c Texas 64 • Linux UNIX hosts • SUN 4500 • Linux quad Czech R. 32 DØ computing/data handling/database architecture fnal.gov Startap Chicago CISCO STK 9310 powderhorn ADIC AML/2 LINUX farm 300+ dual PIII/IV nodes ENSTORE movers switch switch Central Analysis Backend (CAB) 160 dual 2GHz Linux nodes 35 GB cache ea. b d0dbsrv1 d0lxac1 a c SGI Origin2000 128 R12000 prcsrs 27 TB fiber channel disks Experimental Hall/office complex a: production c: development d0ora1 switch Fiber to experiment RIP data logger collector/router L3 nodes ClueDØ Linux desktop user cluster 227 nodes

  3. SAM Data Management System • SAM is Sequential data Access via Meta-data • Est. 1997. http://d0db.fnal.gov/sam • Flexible and scalable distributed model • Field hardened code • Reliable and Fault Tolerant • Adapters for mass storage systems: Enstore, (HPSS, and others planned) • Adapters for Transfer Protocols: cp, rcp, scp, encp, bbftp, GridFTP. • Useful in many cluster computing environments: SMP w/ compute servers, Desktop, private network (PN), NFS shared disk,… • Ubiquitous for DØ users SAM Station – 1. Collection of SAM servers which manage data delivery and caching for a node or cluster 2. The node or cluster hardware itself

  4. Regional Center Analysis site Overview of DØ Data Handling Integrated Files Consumed vs Month (DØ) Summary of DØ Data Handling 4.0 M Files Consumed Integrated GB Consumed vs Month (DØ) 1.2 PB Consumed Mar2003 Mar2002

  5. Data In and out of Enstore (robotic tape storage) Daily Aug 16 to Sep 20 1 TB Incoming. Shutdown starts 5 TB outgoing

  6. Consumption 180 TB consumed per month 1.5 PB Consumed in 1yr • Applications “consume” data • In DH system: • consumers can be hungry or satisfied • allowing for consumption rate, the next course delivered before asking.

  7. Challenges • Getting SAM to meet the needs of DØ in the many configurations is and has been an enormous challenge. Some examples include… • File corruption issues. Solved with CRC. • Preemptive distributed caching is prone to race conditions and log jams or Gridlock. These have been solved. • Private networks sometimes require “border” services. This is understood. • NFS shared cache configuration provides additional simplicity and generality, at the price of scalability (star configuration). This works. • Global routing completed. • Installation procedures for the station servers have been quite complex. They are improving and we plan to soon have “push button” and even “opportunistic deployment” installs. • Lots of details with opening ports on firewalls, OS configurations, registration of new hardware, and so on. • Username clashing issues. Moving to GSI and Grid Certificates. • Interoperability with many MSS. • Network attached files. Sometimes, the file does not need to move to the user.

  8. RAC:Why Regions are Important • Opportunistic use of ALL computing resources within the region • Management for resources within the region • Coordination of all processing efforts is easier • Security issues within the region are similar, CA’s, policies… • Increases the technical support base • Speak the same language • Share the same time zone • Frequent Face-to-face meetings among players within the region. • Physics collaboration at a regional level to contribute to results for the global level • A little spirited competition among regions is good

  9. Summary of Current & Soon-to-be RACs *Numbers in () represent totals for the center or region, other numbers are DØ’s current allocation.

  10. FNAL MSS,25TB UK RAC RAL 3.6TB Manchester Lancaster LeSc Imperial(CMS) • Global File Routing • FNAL throttles transfers • Direct access unnecessary • Firewalls, policies,… • Configurable, with fail-overs

  11. From RAC’s to RichesSummary and Future • We feel that the RAC approach is important to more effectively use remote resources • Management and organization in each region is as important as the hardware. • However… • Physics group collaboration will transcend regional boundaries • Resources within each region will be used by the experiment at large (Grid computing Model) • Our models of usage will be revisited frequently. Experience already indicates that the use of thumbnails differs from that of our RAC model (skims). • No RAC will be completely formed at birth. • There are many challenges ahead. We are still learning…

  12. Stay Tuned for SAM-GridThe best is yet to come…

  13. CPU intensive activities • Primary reconstruction • On-site, with local help to keep-up. • MC production • Anywhere. No input data. • Re-reconstruction (reprocessing) • Must be fast to be useful • Use all resources. • Thumbnail skims • 1 per physics group • Common skim – OR of group skims • End up with all events if triggers are good • Defeats the object, i.e. small datasets • User analysis – not a priority (CAB can satisfy demand) First on SAMGrid

  14. Current Reprocessing of D0RunII • Why now and fast? • Improved tracking for Spring conferences • Tevatron shutdown – include reconstruction farm • Reprocess all RunII data • 40TB of DST data • 40k files (basic unit of Data Handling) • 80 million events • How • Many sites in US and Europe, inc. UK RAC • qsub initially, but UK will lead move to SAMGrid. • Nikhef (LCG) • Will gather statistics and report.

  15. Runjob and SAMGrid • Runjob workflow manager • Maintained by Lancaster. Mainstay of D0 MC production. • No difference between MC production and data (re)processing. • SAMGrid integration • Was done for GT2.0, eg.tier1a via EDG1.4 CE • Job Bomb: 1 grid job-to-many local BS jobs, i.e. job has structure. • Request 2.0 gatekeeper(0mth), write custom perl jobmanagers(2mth), or use DAGman to absorb structure(3mth) • Pressure to use grid-submit - want 2.0 for now. • 4UK sites, 0.5FTE’s – need to use SAMGrid.

  16. Conclusions • SAM enables PB scale HEP computing today. • Details are important in production system • PN’s, NFS, scaling, cache management(free space=zero, always), gridlock,… • Official & semi-official tasks dominate cpu requirements. • reconstruction, reprocessing, MC production, skims. • by definition these are structured, repeatable – good for grid. User analysis runs locally(still needs DH), or centrally. (Still project goal – just not mine) • SAM experience valuable – see report on reprocessing. Have LCG seen how good it is?

More Related