170 likes | 181 Views
LHC Network Issues. Joint Tech Meeting February 14 , 2007 Ian Fisk. Large Hadron Collider. The LHC is in the final year of construction at CERN in Geneva, Switzerland. LHC Schedule. LHC will have a pilot run in December of 2007
E N D
LHC Network Issues Joint Tech Meeting February 14 , 2007 Ian Fisk
Large Hadron Collider • The LHC is in the final year of construction at CERN in Geneva, Switzerland
LHC Schedule • LHC will have a pilot run in December of 2007 • Beam energy will be only the injector energy 900GeV (~1/15th of the design energy) • Machine luminosity will be low, but data volumes may be high for the run as detectors try to take as much calibration data as possible • There is no discovery potential in the pilot run, but the data will be critical for calibration and preparations • High Energy running begins in the early summer of 2007 • 14TeV beams for the remainder of the year • Enormous commissioning work in the presence of the new energy frontier • Luminosity and data volume increases in 2009 and 2010 • Before the pilot run both ATLAS and CMS have preparation challenges • ATLAS has a “dress rehearsal” starting in the summer • CMS has a 50% computing challenge for offline computing
Data Volume and Distribution • Each detector is capable of producing a raw data sample of a 2PB-3PB in a nominal year of data taking • A similar sized simulation sample of simulated events will be produced at 25-50 Tier-2 computing centers • The collaborations are the largest ever attempted with ~2000 scientists each • Large potential dynamic samples of data for analysis needs to get into a lot of hands in a lot of places • Only ~20% of the computing capacity is located at CERN • The detector and the distributed computing facility need to be commissioned concurrently • Nearly all running high energy physics experiments have some distributed computing (some even have reached a majority offsite) • Most started with a large local dedicated computing center for the start
Tier-1 to Tier-1 Burst with Rereco Load Balancing Tier-0 to Tier-1 Flow Predictable High Priority Tier-1 to Tier-2 Burst with User Needs (From 1 Tier-1 in case of ATLAS) Tier-2 to Tier-1 Predictable Simulation Data • Both ATLAS and CMS chose distributed computing models from early on • Variety of motivating factors (infrastructure, funding, leverage) Tier-0 Tier-1 Tier-1 Tier-1 Tier-1 Tier-0 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2
Responsibilities of the Tiers • Tier-0 • Primary reconstruction • Partial Reprocessing • First archive copy of the raw data • Tier-1s • Share of raw data for custodial storage • Data Reprocessing • Analysis Tasks • Data Serving to Tier-2 centers for analysis • Archive Simulation From Tier-2 • Tier-2s • Monte Carlo Production • Primary Analysis Facilities
Network Estimates • From the CMS Computing Model (ATLAS is similar though slightly higher): • The network requirements for Tier-0 to Tier-1 transfers are driven by the trigger rate and the event size • Estimates are ~2.5Gb for a nominal Tier-1 center • The Tier-1 event share with a factor of 2 recovery factor and a factor of 2 provisioning factor • The Tier-1 to Tier-1 transfers are driven by the desire to synchronize re-reconstruction samples within a short period of time • To replicate the newly created reconstructed and AOD between Tier-1 centers in a week is 1Gb/s, before the safety and provisioning factors • The Tier-1 to Tier-2 transfers are less predictable • Driven by user activities. • CMS model estimates this at 50-500MB/s (Includes safety factors) • Tier-2 to Tier-1 transfers are predictable in rate and low in volume • Averaged over the entire it’s ~1TB per day.
CERN to Tier-1 Connectivity • Connectivity from CERN to Tier-1 centers provided by the OPN • Also provides Tier-1 to Tier-1 Connectivity
Tier-1 to Tier-2 Connectivity • In order to satisfy their mission as a primary resource for experiment analysis the Tier-2 need good connectivity to the Tier-1 centers • Data is served from Tier-1 computing centers. • In CMS Each Tier-1 is assigned a share • In ATLAS the Tier-1s have a complete analysis • The connectivity between the Tier-1 and Tier-2 centers can be substantially higher than the Tier-0 to Tier-1 rates • Already in the computing challenge the incoming rate to FNAL is half the outgoing rate to Tier-2 centers • The network that carries the Tier-2 traffic is going to be instrumental to the experiment’s success. • The Tier-2 traffic is a more difficult networking problem • The number of connections is large • There are a diverse set of locations and setups
Tier-2 and Tier-3 Centers • A Tier-2 center in ATLAS and CMS are approximately 1MSI2k of computing • Tier-3 centers belong to university groups and can be of comparable size • A Tier-2 center in ATLAS and CMS ~200TB of disk • Currently procuring and managing this volume of storage is expensive and operationally challenging • Requires a reasonably virtualization layer • A Tier-2 center has between 2.5 Gb/s and 10Gb/s of connectivity in the US • This is similar between Tier-2 and Tier-3 centers • The speed of connection to the local sites has increased rapidly • In the US-CMS planning a Tier-2 supports 40 Physicists performing analysis • This is a primary difference between a Tier-2 and a Tier-3 • Tier-2 centers are a funded effort of the experiment • The central project has expectations of them
Surviving the first years • The computing for either experiment is hardest as the detector is being understood • The analysis object data for CMS is estimated at 0.05MB • An entire year’s data and reconstruction are only 300TB • Data is divided into ~10 trigger streams and ~50 offline streams • A physics analysis should rely on 1 trigger stream • A Tier-2 could potentially maintain all the analysis objects for the majority of the analysis streams • Unfortunately, until the detector and reconstruction are completely understood the AOD is not useful for most analysis and access to the raw data will be more frequent • The full raw data is 35 times bigger • Given the experience of the previous generation of detectors, we should expect about 3 years to stabilize • People working at Tier2 centers can make substantial, but bursty requirements of the data transfers
Analysis Selections • When going back to the raw data and complete simulation, analysis selections on a complete trigger streams • 1% selection on data and MC would be 4TB, 10% selection would be 40TB • Smaller by factor of 5 if only the offline stream can be used • There are an estimated 40 people working at a Tier-2 • If half the people perform the small selections at the level of twice a month • This is already 50MB/s on average and everyone is working asynchronously • The original analysis estimates were once a week • 10% selections will happen • 100MB/s x 7 Tier-2s would be ~6Gb/s from a Tier-1 • Size of selections, number of active people and frequency of selections all have significant impact on the total network requirements • Can easily arrive at 500MB/s for bursts.
Tier-3 Connectivity • Tier-2s are a resource for the physics community. • Even people with significant university clusters at home have the opportunities to use Tier-2 • The use of Tier-3s for analysis is foreseen in the model • These are not resources for the whole experiment and can have lower priority for access to common resources • The number of active physicist supported at a Tier-3 center is potentially much smaller than a Tier-2 • 4-8 people • This leads to smaller sustained network use • but similar requirements to T2s to enable similar turn-around times/latencies for physics datasets copied to T3 sites for analysis • For CMS, the Tier-3s are similar in analysis functionality but not in capacity • Data Connectivity is expected to all Tier-1 sites
Tier-2 and Tier-3 Connectivity to Sites • The network from CERN to Tier-1s is a dedicated resource • The Tier-2 and Tier-3 connections to Tier-1 centers are expected to go over the normal backbones (Internet2, ESNET) • In the ATLAS model the Tier-2 centers connect primarily to one Tier-1. In the case of the US centers, this is Brookhaven • In the CMS model, the Tier-2 centers connect to the Tier-1 hosting the data. • About 50% of the data will be resident at FNAL • The other 50% is hosted in 5 European Tier-1 and 1 Asian Tier-2 • The European Tier-2s will also connect to FNAL for roughly half the connections • A number of the US-Tier-2s either own connectivity to StarLight or participate in projects like UltraLight • Connections to Europe will be over shared resources and peerings with european providers
Examples of Transfers • Tier-1 to Tier-2 transfers within the US with SRM disk-to-disk
Examples of Transfers (Cont.) • Transfers from FNAL to European Tier-2s
Outlook • Volume of data and the level of distribution presents new challenges for the LHC computing • Distributed computing is an integral part of the experiment’s success • Making efficient use of the network to move large quantities of data is critical to the success of distributed computing • The Tier-1 centers have custodial responsibility for raw data • They are a natural extension of the online system and the network rates are predictable. • The network for CERN to Tier-1 is dedicated • The Tier-2 centers are resources for analysis • Both experiments are learning to perform efficient data transfers • Transfers are driven by user needs and demands can be high • A lot of work to do in the last year