260 likes | 269 Views
This overview provides information on the goals and milestones of the LCG service challenges, including the demonstration of high data throughput and the building of a production GRID infrastructure.
E N D
LCG Service ChallengesOverview Victor Zhiltsov JINR Based on SC Conferences and meetings: http://agenda.cern.ch/displayLevel.php?fid=3l181
SC3 GOALS • Service Challenge 1 (end of 2004): Demonstrate the possibility of throughput of 500 MByte/s toTier1 inLCG environment. • Service Challenge 2 (spring 2005): Maintain the throughput 500 MByte/scumulative on allTier1sfor prolonged time, and evaluate the data transfer environment onTier0 и Tier1s. • Service Challenge 3 (Summer-end of 2005) Show reliable and stable data transfer on eachTier1: to disk -150 MByte/s, to tape - 60 MByte/s. AllTier1s and someTier2sinvolved. • Service Challenge 4 (Spring 2006): Prove theGRID infrastructureperformance to handle theLHC datain proposed rate (from raw data transfer up to final analysis)with allTier1sand majority ofTier2s. • Final Goal: Build the productionGRID-infrastructure on allTier0, Tier1 и Tier2 according to theLHC experiments specifics.
Summary of Tier0/1/2 Roles • Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times; • Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s; • Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction. No long term data storage. N.B. there are differences in roles by experiment Essential to test using complete production chain of each!
SC2 met its throughput targets • >600MB/s daily average for 10 days was achieved - Midday 23rd March to Midday 2nd April • Not without outages, but system showed it could recover rate again from outages • Load reasonable evenly divided over sites (give network bandwidth constraints of Tier-1 sites)
Storage and Software used • Most sites ran Globus gridftp servers • CCIN2P3, CNAF, GridKa, SARA • The rest of the sites ran dCache • BNL, FNAL, RAL • Most sites used local or system-attached disk • FZK used SAN via GPFS • FNAL used production CMS dCache, including tape • Load-balancing for gridftp sites was done by the RADIANT software running at CERN in push mode
Triumf CNAF CERN Tier-0 1G PIC 1G shared 10G GridKa 10G 1G shared 1G shared 2x1G IN2P3 GEANT Nether Light ESNet StarLight UKLight 10G 10G 2x1G 2x1G 2x1G 10G 2x1G 2x1G 2x1G 10G 10G ASCC RAL SARA Tier0/1 Network Topology April-July changes BNL Nordic FNAL
GridPP Estimates of T2 Networking 1 kSI2k corresponds to 1 Intel Xeon 2.8 GHz processor; • The CMS figure of 1Gb/s into a T2 comes from the following: • Each T2 has ~10% of current RECO data and 1/2 AOD (real+MC sample) • These data are refreshed every 3 weeks • compatible with frequency of major selection pass at T1s • See CMS Computing Model S-30 for more details
SC3 – Milestone Decomposition • File transfer goals: • Build up disk – disk transfer speeds to 150MB/s • SC2 was 100MB/s – agreed by site • Include tape – transfer speeds of 60MB/s • Tier1 goals: • Bring in additional Tier1 sites wrt SC2 • PIC and Nordic most likely added later: SC4? US-ALICE T1? Others? • Tier2 goals: • Start to bring Tier2 sites into challenge • Agree services T2s offer / require • On-going plan (more later) to address this via GridPP, INFN etc. • Experiment goals: • Address main offline use cases except those related to analysis • i.e. real data flow out of T0-T1-T2; simulation in from T2-T1 • Service goals: • Include CPU (to generate files) and storage • Start to add additional components • Catalogs, VOs, experiment-specific solutions etc, 3D involvement, … • Choice of software components, validation, fallback, …
June05 - Technical Design Report Credibility Review by LHCC Sep05 - SC3 Service – 8-9 Tier-1s sustain - 1 Gbps at Tier-1s, 5 Gbps at CERN Extended peaks at 10 Gbps CERN and some Tier-1s Jan06 - SC4 Setup – AllTier-1s 10 Gbps at >5 Tier-1s, 35 Gbps at CERN July06 - LHC Service – All Tier-1s 10 Gbps at Tier-1s, 70 Gbps at CERN 2005 2006 2007 2008 SC2 SC3 SC4 First physics cosmics LHC Service Operation First beams Full physics run Key dates for Connectivity
2005 2006 2007 2008 SC2 SC3 First physics cosmics SC4 First beams Full physics run Historical slides from Les / Ian 2005 Sep-Dec - SC4 preparation In parallel with the SC3 model validation period,in preparation for the first 2006 service challenge (SC4) – Using 500 MByte/s test facility • test PIC and Nordic T1s • and T2’s that are ready (Prague, LAL, UK, INFN, ..) Build up the production facility at CERN to 3.6 GBytes/s Expand the capability at all Tier-1s to full nominal data rate
2005 2006 2007 2008 SC2 SC3 SC4 First physics cosmics First beams Full physics run slides from Les / Historical Ian 2006 Jan-Aug - SC4 SC4 – full computing model services - Tier-0, ALL Tier-1s, all major Tier-2s operational at full target data rates (~2 GB/sec at Tier-0)- acquisition - reconstruction - recording – distribution, PLUS ESD skimming, servicing Tier-2s Goal – stable test service for one month – April 2006 100% Computing Model Validation Period (May-August 2006) Tier-0/1/2 full model test - All experiments - 100% nominal data rate, with processing load scaled to 2006 cpus
2005 2006 2007 2008 SC2 SC3 SC4 First physics cosmics LHC Service Operation First beams Full physics run Historical slides from Les / Ian 2006 Sep – LHC service available The SC4 service becomes the permanent LHC service – available for experiments’ testing, commissioning, processing of cosmic data, etc. All centres ramp-up to capacity needed at LHC startup • TWICE nominal performance • Milestone to demonstrate this 3 months before first physics data April 2007
Tier2 Roles • Tier2 roles vary by experiment, but include: • Production of simulated data; • Production of calibration constants; • Active role in [end-user] analysis • Must also consider services offered to T2s by T1s • e.g. safe-guarding of simulation output; • Delivery of analysis input. • No fixed dependency between a given T2 and T1 • But ‘infinite flexibility’ has a cost…
Tier2 Functionality (At least) two distinct cases: • Simulation output • This is relatively straightforward to handle • Most simplistic case: associate a T2 with a given T1 • Can be reconfigured • Logical unavailability of a T1 could eventually mean that T2 MC production might stall • More complex scenarios possible • But why? Make it as simple as possible, but no simpler… • Analysis • Much less well understood and likely much harder…
Tier2s are assumed to offer, in addition to the basic Grid functionality: • Client services whereby reliable file transfers maybe initiated to / from Tier1/0 sites, currently based on the gLite File Transfer software (gLite FTS); • Managed disk storage with an agreed SRM interface, such as dCache or the LCG DPM.
To participate in the Service Challenge, it is required: • Tier2 sites install the gLite FTS client and a disk storage manager (dCache?), • For the throughput phase, no long term storage of the data transferred is required, but they nevertheless need to agree with the corresponding Tier1 that the necessary storage area to which they upload data (analysis is not included in Service Challenge 3) and the gLite FTS backend service is provided.
Tier2 Model • As Tier2s do not typically provide archival storage, this is a primary service that must be provided to them, assumed via a Tier1. • Although no fixed relationship between a Tier2 and a Tier1 should be assumed, a pragmatic approach for Monte Carlo data is nevertheless to associate each Tier2 with a ‘preferred’ Tier1 that is responsible for long-term storage of the Monte Carlo data produced at the Tier2. • By default, it is assumed that data upload from the Tier2 will stall should the Tier1 be logically unavailable. This in turn could imply that Monte Carlo production will eventually stall, if local storage becomes exhausted, but it is assumed that these events are relatively rare and the production manager of the experiment concerned may in any case reconfigure the transfers to an alternate site in case of prolonged outage.
A Simple T2 Model (1/2) N.B. this may vary from region to region • Each T2 is configured to upload MC data to and download data via a given T1 • In case the T1 is logical unavailable, wait and retry • MC production might eventually stall • For data download, retrieve via alternate route / T1 • Which may well be at lower speed, but hopefully rare • Data residing at a T1 other than ‘preferred’ T1 is transparently delivered through appropriate network route • T1s are expected to have at least as good interconnectivity as to T0
A Simple T2 Model (2/2) • Each Tier-2 is associated with a Tier-1 that is responsible for getting them set up • Services at T2 are managed storage and reliable file transfer • FTS: DB component at T1, user agent also at T2; DB for storage at T2 • 1GBit network connectivity – shared (less will suffice to start with, more maybe needed!) • Tier1 responsibilities: • Provide archival storage for (MC) data that is uploaded from T2s • Host DB and (gLite) File Transfer Server • (Later): also data download (eventually from 3rd party) to T2s • Tier2 responsibilities: • Install / run dCache / DPM (managed storage s/w with agreed SRM i/f) • Install gLite FTS client • (batch service to generate & process MC data) • (batch analysis service – SC4 and beyond) • Tier2s do not offer persistent (archival) storage!
Service Challenge in Russia T2 T1 T0
Summary • The first T2 sites need to be actively involved in Service Challenges from Summer 2005 • ~All T2 sites need to be successfully integrated just over one year later • Adding the T2s and integrating the experiments’ software in the SCs will be a massive effort! • Initial T2s for SC3 have been identified • A longer term plan is being executed
Conclusions • To be ready to fully exploit LHC, significant resources need to be allocated to a series of Service Challenges by all concerned parties • These challenges should be seen as an essential on-going and long-term commitment to achieving production LCG • The countdown has started – we are already in (pre-)production mode • Next stop: 2020