1 / 30

Distributed Computing and Data Analysis for CMS in view of the LHC startup

Distributed Computing and Data Analysis for CMS in view of the LHC startup. Peter Kreuzer RWTH-Aachen IIIa. International Symposium on Grid Computing (ISGC) Taipei, April 9, 2008. Outline. Brief overview of Worldwide LHC Grid: WLCG Distributed Computing Challenges at CMS Simulation

gwyn
Download Presentation

Distributed Computing and Data Analysis for CMS in view of the LHC startup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC) Taipei, April 9, 2008

  2. Outline • Brief overview of Worldwide LHC Grid: WLCG • Distributed Computing Challenges at CMS • Simulation • Reconstruction • Analysis • The physicist view • The road to the LHC startup Peter Kreuzer - CMS Computing & Analysis

  3. (x 10) (x 10) • Since 20 years the amount of Data and of Physicists per experiment grew drastically From local to distributed Analysis • Before : centrally organised Analysis • Example CMS :4-6 PBytes data per year, 2900 scientists, 40 countries, 184 institutes ! • Solution : ´´Tiered´´ Computing Model Peter Kreuzer - CMS Computing & Analysis

  4. Level of distribution motivated by the desire to leverage and empower resources + share load, infrastructure and funding Worldwide LHC Computing GRID • Tier-0 at CERN • Prompt Reconstruction • Calibration and Low • latency work • Archiving • 1.0 GByte/s • Tier-1s at large national • labs or universities • Re-Reconstruction • Physics ´skimming´ • Data Serving • Archiving Aggregate Rate from CERN to Tier-1s  > 1.0 GByte/s • Transfer Rate • to Tier-2 • 50-500 MBytes/s • Tier-2s primarily at Universities • Simulation • User Analysis • Tier-3s at Institutes with • modest Infrastructure • Local User Analysis • Opportunistic Simulation Peter Kreuzer - CMS Computing & Analysis

  5. WLCG Infrastructure • EGEE Enabling Grid for E-Science • OSG Open Science Grid 1 Tier-0 + 11 Tier-1 + 67 Tier-2 CMS : 1 Tier-0 + 7 Tier-1 + 35 Tier-2 Tier-0 -- Tier-1: dedicated 10Gbs Optical Network Peter Kreuzer - CMS Computing & Analysis

  6. Examples of Sites • T2 RWTH (Aachen) • CPU : 540 KSI2k = 360 cores • Disc : 100TB • Network (WAN): 2Gbit/sec • (2009 : 450 cores & 150TB) • T1 ASGC • CPU: 2.4 MSI2k ~1800 cores • Disc : 930TB  1.5PB • Tape : 586TB  800TB  • Network : 10Gbit/sec • T2 Taiwan • CPU :150 KSI2k • Disc : 19TB  62TB • Network : up to10Gbit/sec Peter Kreuzer - CMS Computing & Analysis

  7. Tier-2 Tier-1 CERN Pledged WLCG Resources 250,000 cores 2008 : 66,000 cores (Tape Storage = 33 PBytes in 2008)  CPU MSI2k • 1MSI2K = 670 cores 2008 : 40 PetaBytes Disc Storage  PetaBytes Peter Kreuzer - CMS Computing & Analysis (Reference : LCG Project Planning – 1.3.08)

  8. Challenges for Experiments :Example CMS • Scale-up and test distributed Computing Infrastructure • Mass Storage Systems and Computing Elements • Data Transfer • Calibration and Reconstruction • Event ´skimming´ • Simulation • Distributed Data Analysis • Test CMS Software Analysis Framework • Operate in quasi-real data taking conditions and simulateously at various Tier levels  Computing & Software Analysis (CSA) Challenge Peter Kreuzer - CMS Computing & Analysis

  9. CMS Computing and Software Analysis Challenges • CMS Scaling-up in the last 4 years Test (year) Goal : Jobs/day Scale • DC04 : 15,000 5% • 2005 - 2006 : New Data Model and New Software Framework • CSA06 : 50,000 25% • CSA07 : 100,000 50% • CSA08 : 150,000 100% • Requires 100s M simulated events input ? Peter Kreuzer - CMS Computing & Analysis

  10. The CSA07 data Challenge 100M Simulated Data Reconstruction100Hz TIER-0 CASTOR HLT CAF Calibration& Express Analysis 300MB/s Re-Reconstruction Skimms25k jobs/day TIER-1 TIER-1 TIER-1 TIER-1 20-200MB/s ~10MB/s TIER-2 TIER-2 TIER-2 TIER-2 Analysis 75k jobs/day Simulation 50M evt/month Peter Kreuzer - CMS Computing & Analysis

  11. In this presentation • Mainly covering CMS Simulation, Reconstruction and Analysis challenges • Data transfers challenges covered in talk by Daniele Bonacorsi during this session Peter Kreuzer - CMS Computing & Analysis

  12. CMS Simulation System CMS Physicist << Please simulate new physics >> << Where are my data ? >> Tier-1 Global Data Bookkeeping (DBS) Tier-2 ProdAgent ProdRequest Tier-2 Production Manager Tier-2 Tier-2 ProdAgent ProdAgent Tier-2 Tier-2 GRID Tier-2 Tier-2 Peter Kreuzer - CMS Computing & Analysis

  13. Local DBS Local DBS Processing Merging Tier-1 Tier-1 SE ProdAgent SE ProdAgent Merging Grid WMS Processing Tier-2 Grid WMS SE Tier-2 SE Merging Tier-2 SE Processing PhEDEx Large output file from Merge job Tier-2 SE Small output file from Processing job ProdAgent workflows 2) Merging: 1) Processing: • Data processing / bookkeeping / tracking / monitoring in local-scope • Output promoted to global-scopeDBS & Data transfer system PhEDEx • Scaling achieved by running in parallel multipleProdAgent instances Peter Kreuzer - CMS Computing & Analysis

  14. CMS Simulation Performance ~250M Events in 5 months • Tier-2 alone ~72% • OSG alone ~ 50% (Overall 07-08: 450M) • 20k jobs/day reached • < Job efficiency > ~ 75% June – November 2007 M Evts / Month Production Rate x 1.8 70 60 50 40 30 Peter Kreuzer - CMS Computing & Analysis Jul Jan Oct Apr

  15. Utilization of CMS Resources • average ~50% • In best productions periods 75% Missing Requests 5000 job- slots June – November 2007 Peter Kreuzer - CMS Computing & Analysis

  16. CSA07 Simulation lessons • Major boost in scale and reliability of production machinery • Still too many manual operations. From 2008 on: • Deploy ProdManager component (in CSA07 was ´human´ !) • Deploy Resource Monitor • Deploy CleanUpSchedule component • Further improvments in scale and reliability • gLite WMS bulk submission : 20k jobs/day with 1 WMS server • Condor-G JobRouter + bulk submission : 100k jobs/day and can saturate all OSG resources in ~1 hour. • Threaded JobTracking and Central Job Log Archival • Introduced task-force for CMS Site Commissioning • help detect site issues via stress-test tool (enforce metrics) • couple site-state to production and analysis machinery • Regular CMS Site Availability Monitoring (SAM) checks Peter Kreuzer - CMS Computing & Analysis

  17. CMS Site Availability Monitoring Availability Ranking (ARDA ´Dashboard´) 03/22/08 04/03/08 0% 100% • Important tool to protect CMS use cases at sites Peter Kreuzer - CMS Computing & Analysis

  18. CSA07 Reconstruction & Skimming 0) preparation of ´´Primary Datasets´´ mimics real CMS Detector+ Trigger data 1) Archive and Reconstruction at CERN T0 2) Archive and Re-Reconstruction at T1s 3) Skimming at T1s 4) Express analysis & Calibration at CERN Analysis Facility  3 different calibrations 10pb-1,100pb-1, 0pb-1 Peter Kreuzer - CMS Computing & Analysis

  19. Produced CSA07 Data Volumes x1e+8 DIGI-RAW-HLT-RECO events Total CSA07 event counts: 80M GEN-SIM 80M DIGI-RAW 80M HLT 330M RECO (3 diff. calibrations) 250M AOD 100M skims --------------------------- 920M events 10/’07 02/’08 • Total Data volume: ~2PB • Corresponds to expected 2008 volume ! CMS data in CASTOR@CERN: 3.7PB Peter Kreuzer - CMS Computing & Analysis

  20. CSA07 Reconstruction lessons 2k running jobs T0 and T1 processing • T0 Reconstruction at 100Hz only in bursts, mainly due to stream splitting activity • Heavy load on CASTOR • Usefull feedback to ProdAgent Developpers to prepare 2008 data taking (repacker, …) • T1 Processing : submission rate was main limitation. Now based on gLite bulk submission and reaching 12-14k jobs/day with 1 ProdAgent instance • Further rate improvment to be expected with T1 resource up-scaling Peter Kreuzer - CMS Computing & Analysis

  21. CMS Analysis System CRAB = CMS Remote Analysis Builder An interface to the GRID for CMS physicists Challenge : match processing resources with large quantities of data = ´´chaotic´´ Processing Tier-1 Tier-2 Global Data Bookkeeping (DBS) << Please analyse datasets X/Y >> CMS Physicist Tier-2 CRAB << Where are my jobs ? >> Tier-2 Tier-2 CRAB Server Tier-2 Tier-2 GRID Tier-2 Tier-2 Peter Kreuzer - CMS Computing & Analysis

  22. CRAB Architecture • Easy and transparent means for CMS users to submit analysis jobs via the GRID (LCG RB, gLite WMS, Condor-G) • CSA07 analysis: direct submission by user to GRID. Simple, but lacking automation and scalability •  2008 : CRAB server • Other new feature: local DBS for “private” users Peter Kreuzer - CMS Computing & Analysis

  23. Main causes: • data-access • remote stage out • manual user settings CSA07 Analysis • 100k jobs/day not achieved • mainly due to lacking data during the challenge • still limitted by data distribution: 55% jobs at 3 largest Tier-1s • and failure rate too high 53% Successful Jobs 20% failed Jobs 27% Unknown 20k jobs/day achieved + regularly ~30k/day JobRobot submissions Number of jobs Peter Kreuzer - CMS Computing & Analysis

  24. CMS Grid Users since 1 year • plot showing distinct users • 300 users during February 2008 • 20 most active users carry 1/3 of jobs Users Month CRAB Server Peter Kreuzer - CMS Computing & Analysis

  25. The Physicist View • SUSY Search in di-lepton + jets + MET • Goal : Simulate excess over Standard Model (´LM1´ at 1 fb-1) • Infrastructure • 1 desktop PC • CMS Software Environment (´CMSSW´ , ´CRAB´, ´Discovery´ GUI, …) • GRID Certificate + member of a Virtual Organisation (CMS) • Input data (CSA07 simulation/production) • Signal (RECO) : 120k events = 360 GB • Skimmed Background (AOD) : 3.3 M events = 721 GB • WW / WZ / ZZ / single top • ttbar / Z / W + jets • Unskimmed Background : 27 M events = 4 TB (for detailed studies only) • Location of input data • T0/T1 : CERN (CH), FNAL (US), FZK (Germany) • T2 : Legnaro (Italy), UCSD (US), IFCA (Spain) ~1.1 TB Peter Kreuzer - CMS Computing & Analysis

  26. GRID Analysis Result End-Point Signal • Analysis Latency • Signal + Bgd = • 322 jobs  22h to produce this result ! • Detailed studies = 1300 jobs  ~3.5 days Z peak from SUSY cascades [GeV] Georgia Karapostoli – Athens Univ. Peter Kreuzer - CMS Computing & Analysis

  27. CSA07 Analysis lessons • Improve Analysis scalability, automation and reliability • CRAB-Server • Automate job re-submission • Optimize job distribution • Decrease failure rate • Move Analysis to Tier-2s • To protect Tier-0/1 LSF and storage systems • To make use of all available GRID resources • Encourage Tier-2_to_Physics_group association • In close collaboration with sites • With solid overall Data Management strategy • Assess local scope DM for Physics groups & storage of user data • Aim for 500 users by June and exceed capacity of several gLite WMS Peter Kreuzer - CMS Computing & Analysis

  28. Goals for CSA08 (May ’08) • “Play through” first 3 months of data taking • Simulation • 150M events at 1 pb-1(“S43”) • 150M events at 10 pb-1 (“S156”) • Tier-0 : Prompt reconstruction • S43 with startup-calibration • S156 with improved calibration • CERN Analysis Facility (CAF) • Demonstrate low turn-around Alignment&Calibration workflows • Coordinated and time-critical physics analyses • Proof-of-principle of CAF Data and Workflow Managment Systems • Tier-1 : Re-Reconstruction with new calibration constants • S43 : with improved constants based on 1 pb-1 • S156 : with improved constants based on 10 pb-1 • Tier-2 : • iCSA08 simulation (GEN-SIM-DIGI-RAW-HLT) • repeat CAF-based Physics analyses with Re-Reco data ? Peter Kreuzer - CMS Computing & Analysis

  29. 2008 Detector installation, commissioning and operation Preparation of Software, Computing and Physics analysis 2007 Physics Analyses results Cooldown of magnet Private global runs (2 days/week) & Private mini-daq CCRC’08-1 GRUMM CMSSW 1.8.0 sample production CMSSW 2.0 release [production start-up MC samples] Low i test 2 weeks of 2.0 testing Beam-pipe baked-out Pixels installed CR 0T iCSA08 sample generation “CROT” CR 0T iCSA08 / CCRC’08-2 CMS closed pre CR 4T “CRAFT” CMSSW 2.1 release [all basic sw components ready for LHC, new T0 prod tools] Initial CMS ready for run CR 4T fCSA08 or beam! Must keep exercises mostly non-overlapped CCRC = Common-Vo Computing Readiness Challenge CR = Commissioning Run Peter Kreuzer - CMS Computing & Analysis

  30. Where do we stand ? • WLCG : major up-scaling since 2 years ! • CMS : impressive results and valuable lessons from CSA07 • Major boost in Simulation • Produced~2 PBytes data in T0/T1 Reconstruction and Skimming • Analysis : number of CMS Grid-users ramping up fast ! • Software : addressed memory footprint and data size issues • Further Challenges for CMS : scale from 50% to 100% • Simultaneous and continuous operations at all Tier levels • Analysis distribution and automation • Transfer rates (see talk by D.Bonacorsi) • Upscale and commission the CERN Analysis Facility (CAF) • CSA08, CCRC08, Commissioning Runs • Challenging and motivating goals in view of Day-1 LHC ! Peter Kreuzer - CMS Computing & Analysis

More Related