260 likes | 272 Views
T1 visit to IN2P3 Computing. Topics here: Resources User Support (-> questionnaire) CSA lessons (briefly) PADA CCRC’08 = CSA08. Matthias Kasemann November 2007. CMS Computing Organization Chart. Offline. Computing. Computing Resource Board. Matthias Kasemann Patricia McBride.
E N D
T1 visit to IN2P3 Computing • Topics here: • Resources • User Support (-> questionnaire) • CSA lessons (briefly) • PADA • CCRC’08 = CSA08 Matthias Kasemann November 2007 T1-visit to IN2P3
CMS Computing Organization Chart Offline Computing Computing Resource Board Matthias Kasemann Patricia McBride Chair - Dave Newbold Common Coordination Integration/CSA07: Ian Fisk/Neil Geddes Resource Coordination: Lucas Taylor 2nd convener identified for Facilities Operation: - awaiting CMS approval Facilities / Infrastructure Operations Computing Commissioning Daniele Bonacorsi/ Peter Kreuzer Stefano Belforte / Frank Wuerthwein Data Operations User Support Christoph Paus / Interim - Lothar Bauerdick Kati Lassila-Perini/ Akram Khan Data operations: -looking for a person At FNAL 2nd convener identified for Facilities Operation: - awaiting CMS approval T1-visit to IN2P3
CMS Computing Resource requirements Resource planning (2008 and beyond): • Resources (cpu, disk, tape) need to be adjusted where possible to match the CMS requirements. • Adjustments seem feasible, but details have to be optimized and negotiated. • Currently: 30% deficit in tape resources for 2008 • Resource estimate recently updated based on DataModel and Software Performance Promised resources for 2008 T1-visit to IN2P3
CMS computing resources (2008 pledged) • CMS needs all the T1 and the T2 resources for successful data analysis. • Total T2 is: 18500 MSI2k, 4700 TB • T2(F): about 4% CPU and disk • T2(Be): about 6% CPU and disk • T2(China): about 3(4)% CPU (disk) • For the CMS planning we work with resource numbers (pledges) from the WLCG MoU. • CMS recently increased the estimates for storage required (disk and tape). • CMS is short of resources at T1 centers, especially for storage. • Risk to impact performance significantly T1-visit to IN2P3
Tier-1 Resources Outlook • Main shortfall will be in disk storage (even in 2008) • Have to search for flexibility in the model here Required by CMS Pledged for CMS • Substantial increase required for 2010/2011 (with high LHC luminosity running) T1-visit to IN2P3
Tier-2 Resources Outlook • Big increase required for high luminosity analysis starting 2010 • 2010 numbers not final yet • Some T2 pledges known to be missing or altering Required by CMS Pledged for CMS T1-visit to IN2P3
MoAs for Computing and Offline - status • Detailed project plan for MoAs Completed July 2007 • Breakdown of all tasks to Level 4 • Resource-loaded with 165 named people and their FTE fractions Lucas Taylor CMS-FB 19 Sep 07 T1-visit to IN2P3
User Support (2) Kati developed a short questionnaire and asks each T1 center to fill it, see: http://kati.web.cern.ch/kati/t1_quest.html All T1 centres are asked to fill this out to get an overview of User support situation at remote centers T1-visit to IN2P3
CSA07 Goals • Test and validate the components the CMS Computing Model in a simultaneous exercise • the Tier-0, Tier-1 and Tier-2 workflows • Test the CMS software: • particularly the reconstruction and HLT packages • Test the CMS production systems at 50% scale of expected 2008 operation • workflow management, data management, facilities, transfers • Test the computing facilities and mass storage systems. • Demonstrate that data will transfer between production and analysis sites in a timely way. • Test the Alignment and Calibration stream (AlcaReco) • Produce, deliver and store AODs + skims for analysis by physics groups T1-visit to IN2P3
CSA07 Workflows Prompt Reconstruction HLT TIER-0 CASTOR CAF CalibrationExpress-Stream Analysis 300MB/s Re-Reco Skims TIER-1 TIER-1 TIER-1 TIER-1 20-200MB/s ~10MB/s Simulation Analysis TIER-2 TIER-2 TIER-2 TIER-2 T1-visit to IN2P3
Preparing for the CSA07 (Jul-Sep) • CMSSW - software releases organized by offline team • Releases are tested by data operations teams • Distributed and installed to the sites (This is not an easy process.) • Steps for preparing data for physics (pre-challenge workflows): • Generation and Simulation with Geant4 (at the Tier-2 centers) • Digitization • Digi2RAW - format change to look like data input to HLT • HLT processing • Data are split into 7 Primary Data Sets (PDS) based on the HLT information • This was a big addition in CSA07. The data samples more accurately reflect what will come from the detector, but are harder to produce. T1-visit to IN2P3
Preparing for the CSA07 (Jul-Sep) • Planned Workflows for the Challenge: • Reconstruction - HLT + RECO output (~1 MB) • AOD production - (200 kB) • Skims for physics analysis at the Tier-1 centers • Re-Reco (and redoing AOD production/skims) at the Tier-1 centers • Analysis at the Tier-2 centers • Lessons from CSA07 preparations: It turned out that there was insufficient time for testing the components since some of the components were coming at the latest moment.. • CSA08: We have to devote more time for testing the components T1-visit to IN2P3
MC Production summary • (… more resources used) Substantial T1-visit to IN2P3
CSA07 Issues and Lessons • There are clearly areas that are going to need development • Need to work on the CMSSW application • Reduce the number of workflows (in 170, 180 and 200) • Reduce the memory footprint to increase the number of events we can run and increase the available resources • Goal: CMSSW applications should stay in 1 GB memory • Several area should be improved • Access and manipulation of IOV constant (over Xmas) • HLT data model (on going) • New huge increase in memory seen in 170 to be address immediately (mainly in DPG’s code) T1-visit to IN2P3
CSA07 Issues and Lessons • Increase the speed of IO on mass storage • test using new ROOT version • Improve our testing and validation procedures for the applications and workflows. • Reduce event size • RAW/DIGI and RECO size • AOD size • Mini-workshop with physics and DPG Groups on February 5-6-7 (CERN) • Two task forces has been created in order to prepare this workshop • RECO Task Force chair (Shahram Rahatlou) • Analysis Task Force chair (Roberto Tenchini) • FW support for handing of RAW, RECO versus FEVT (This is foreseen for version 2_0_0) T1-visit to IN2P3
CSA07 Issues and Lessons • Need to work on the CMS Tools • Augment the Production tools to be able to better handle continuous operations • Roll back to known good points. • Modify workflows more simply • Increase the speed of Bookkeeping System under specific load conditions • Optimize the data transfers in PhEDEx for data availability • Improve the analysis Tool (CRAB) • Planning a Workshop in January 21-25 2008 (Lyon?). Will be announced soon • Goals: • Review Data and Workload management components • Improvement integration (communication) between operation and development teams • Will include also Tier0 components • Define work plan for 2008 T1-visit to IN2P3
CSA07 Issues and Lessons • Facility Lessons: • We learned a lot about operating Castor and dCache under load • Need to improve the rate of file opens • Need to decrease the rate of errors. • Need to improve the scalability of some components • Need to work on the stability of services at CERN and Tier-1 centers • Need to work on the transfer quality when the farms are under heavy processing load • General lessons: • Much work is needed to achieve simultaneous, sustainable and stable operations T1-visit to IN2P3
PADA: processing and data access taskforce Draft Mandate: Integrate developments and services to bring our centers and services to production quality for processing and analysis The Processing And Data Access Task Force is an initiative in the Integration Program • Designed to transition services developed in Offline to Operations • Elements of integration and testing for Production, Analysis, and Data Management tools • Designed to ensure services and sites used in operations are production quality • Elements in the commissioning program for links and sites • Verify that items identified in the CSA07 are solved • Development work is primarily in offline, but verification in Integration Plan is: • To build on the expertise of the distributed MC production teams, extend scope • We need the expertise in proximity of the centers to help us here • For 2008 we want to make this a recognized service contribution in the MoA scheme, • Initial time frame: 1 year until we have seen the first data • We need to define steps, milestones, recruit people, hope for MC-OPS, DDT, .... T1-visit to IN2P3
Final Check before Data taking starts: CCRC’08 = CSA08CMS A combined challenge by all Experiments must be used to demonstrate the readiness of the WLCG Computing infrastructure before start of data taking at a scale comparable to the data taking in 2008. CMS fully supports the plan, to execute this CCRC in two phases: • a set of functional tests in February 2008 • the final challenge in May 2008 at 100% scale, • starting with the readout of the experiment We must do this challenge as WLCG collaboration: Centers and Experiments together Combined planning has started: • Mailing list created: wlcg-ccrc08@cern.ch • Agenda pages: • Phone conference every Monday afternoon (difficult time for APR…) • Monthly session in pre-GDB meeting T1-visit to IN2P3
CCRC’08 Schedule • Phase 1 - February 2008: • Possible scenario: blocks of functional tests, Try to reach 2008 scale for tests at… • Phase 2: - May 2008: • Full workflows at all centers executed simultaneously by all 4 LHC experiments • Use data from cosmics data run, add artificial load to reach 100% • Duration of challenge: 1 week setup, 4 weeks challenge T1-visit to IN2P3
1) Detector Installation, Commissioning & Operation 2) Preparation of Software, Computing & Physics Analysis S/w Release 1_6 (CSA07) V36 Schedule (Nov’07) CSA07 Cooldown of Magnet: Test S/w Release 1_7 (CCR_0T, HLT Validation) Tracker Insertion 2007 Physics Analyses First Results Out CMS Cosmic Run CCR_0T Several short periods Dec-Mar) S/w Release 1_8 (Lessons of ‘07) Last Heavy Element Lowered Test Magnet at low current Functional Tests CSA08 (CCRC) S/w Release 2_0 (CCR_4T, Production startup MC samples) • Beam-pipe Closed and Baked-out 1 EE endcap Installed, Pixels installed • MC Production for Startup • Cosmic Run CCR_4T CSA08 (CCRC) Combined Computing Readiness Challenge Master Contingency 2nd ECAL Endcap Ready for Installation end Jun’08 T1-visit to IN2P3
CCRC’08 Phase 1: February 2008 • Possible scenario: blocks of functional tests, Try to reach 2008 scale for tests at… • CERN: data recording, processing, CAF, data export • Tier-1’s: data handling (import, mass-storage, export), processing, analysis • Tier-2’s: Data Analysis, Monte Carlo, data import and export Proposed Goals for CMS: Verify (not simultaneously) solutions to CSA07 issues and lessons and attempt to reach ‘08 scale on individual tests • Computing&Software challenge, no physics delivery attached to CCRC’08/1 tests • Cosmics run and MC production have priority if possible • Tests should be as independent from each other as possible • Tests can be done in parallel • Individual test successful if sustained for n days • If full ‘08 scale is not possible (hardware) scale down to hardware limit T1-visit to IN2P3
CCRC’08/1: proposed scope CERN: data recording, processing, CAF, data export • data recording: 250Hz: from P5, HLT, streams, SM to T0, repacking, CASTOR • Processing: 250Hz at T0: CASTOR, CMSSW.x.x, 20out-streams, CASTOR • CAF: to be defined • CERN data export: 600MB/s aggregate to all T1’s MSS Tier-1’s: data handling (import, mass-storage, export), processing, analysis • Data import: T0-T1 to MSS at full ‘08 scale to tape • Data handling: IO for processing and skimming at full ‘08 scale from tape • Processing: re-reconstruction (incl. output streams) at full ‘08 scale from tape • Skimming: develop an executable able to run with >20 skims, run it at T1’s • Data export: T1 to all T1 at full ‘08 scale from tape/disk • Data export: T1 to > 5 T2 at full ‘08 scale from tape/disk • Jobs: handle 50k jobs/day • Data import: >5 T2 to T1 at 20MB/s to tape Tier-2’s: Data Analysis, Monte Carlo, data import and export • Links commissioned: have 40 T2’s with at least 1 commissioned up- and downlink have 30 T2’s with at least 3(or 5) commissioned up- and downlink • Data transfer: import data from 3 T1’s at 20 MB/s • Data transfer: export data to 2 T1’s at 10 MB/s • Data analysis: handle 150k jobs/day (… hard to reach) Reminder: IN2P3 T1 ~ 15% of CMS T1’s T1-visit to IN2P3
Summary (1/2) • In CSA07 a lot was learned and a lot was achieved.. • We hit most of metrics - but separately and intermittently • Several steps accomplished simultaneously • Many workflow steps hit metric routinely Now work on accomplishing all steps simultaneously, and providing stability in a sustainable way. • Global connectivity between T1-T2 sites is still an important issue. • The DDT task force has been successful in increasing the # of working links. • This effort must continue and work must be done to automate the process of testing/commissioning the links. • We still have to increase the number of people involved in facilities, commissioning and operations. • Some recent actions: New (2nd) L2 appointed to lead facility operations (based at CERN) • New Production And Data Access (PADA) Task Force starting - will include some of the people from DDT task force and MC production teams. T1-visit to IN2P3
Summary (2/2) • ~ 200M Events processed and re-processed • Calibration, MC production, Reconstruction, skimming, merging all tested successfully. • Still need time to test the analysis model. • CSA07 Goals for providing data for physics will be accomplished … albeit delayed due to schedule slips • Processing continues to complete the data samples for physics and detector studies. • We are keeping the challenge infrastructure alive and trying to keep it stable, going forward... • Continue to support global detector commissioning and physics studies. • We have to prepare for the ‘Combined Computing Readiness Challenge’, CCRC = CSA08 • Without testing the software and infrastructure we are not prepared… We depend on the support of France, the success of IN2P3 and the French + Belgium + China T2 for the success of CMS computing! T1-visit to IN2P3