LHC Network Issues

LHC Network Issues Joint Tech Meeting February 14 , 2007 Ian Fisk

Large Hadron Collider • The LHC is in the final year of construction at CERN in Geneva, Switzerland

LHC Schedule • LHC will have a pilot run in December of 2007 • Beam energy will be only the injector energy 900GeV (~1/15th of the design energy) • Machine luminosity will be low, but data volumes may be high for the run as detectors try to take as much calibration data as possible • There is no discovery potential in the pilot run, but the data will be critical for calibration and preparations • High Energy running begins in the early summer of 2007 • 14TeV beams for the remainder of the year • Enormous commissioning work in the presence of the new energy frontier • Luminosity and data volume increases in 2009 and 2010 • Before the pilot run both ATLAS and CMS have preparation challenges • ATLAS has a “dress rehearsal” starting in the summer • CMS has a 50% computing challenge for offline computing

Data Volume and Distribution • Each detector is capable of producing a raw data sample of a 2PB-3PB in a nominal year of data taking • A similar sized simulation sample of simulated events will be produced at 25-50 Tier-2 computing centers • The collaborations are the largest ever attempted with ~2000 scientists each • Large potential dynamic samples of data for analysis needs to get into a lot of hands in a lot of places • Only ~20% of the computing capacity is located at CERN • The detector and the distributed computing facility need to be commissioned concurrently • Nearly all running high energy physics experiments have some distributed computing (some even have reached a majority offsite) • Most started with a large local dedicated computing center for the start

Tier-1 to Tier-1 Burst with Rereco Load Balancing Tier-0 to Tier-1 Flow Predictable High Priority Tier-1 to Tier-2 Burst with User Needs (From 1 Tier-1 in case of ATLAS) Tier-2 to Tier-1 Predictable Simulation Data • Both ATLAS and CMS chose distributed computing models from early on • Variety of motivating factors (infrastructure, funding, leverage) Tier-0 Tier-1 Tier-1 Tier-1 Tier-1 Tier-0 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Tier-2

Responsibilities of the Tiers • Tier-0 • Primary reconstruction • Partial Reprocessing • First archive copy of the raw data • Tier-1s • Share of raw data for custodial storage • Data Reprocessing • Analysis Tasks • Data Serving to Tier-2 centers for analysis • Archive Simulation From Tier-2 • Tier-2s • Monte Carlo Production • Primary Analysis Facilities

Network Estimates • From the CMS Computing Model (ATLAS is similar though slightly higher): • The network requirements for Tier-0 to Tier-1 transfers are driven by the trigger rate and the event size • Estimates are ~2.5Gb for a nominal Tier-1 center • The Tier-1 event share with a factor of 2 recovery factor and a factor of 2 provisioning factor • The Tier-1 to Tier-1 transfers are driven by the desire to synchronize re-reconstruction samples within a short period of time • To replicate the newly created reconstructed and AOD between Tier-1 centers in a week is 1Gb/s, before the safety and provisioning factors • The Tier-1 to Tier-2 transfers are less predictable • Driven by user activities. • CMS model estimates this at 50-500MB/s (Includes safety factors) • Tier-2 to Tier-1 transfers are predictable in rate and low in volume • Averaged over the entire it’s ~1TB per day.

CERN to Tier-1 Connectivity • Connectivity from CERN to Tier-1 centers provided by the OPN • Also provides Tier-1 to Tier-1 Connectivity

Tier-1 to Tier-2 Connectivity • In order to satisfy their mission as a primary resource for experiment analysis the Tier-2 need good connectivity to the Tier-1 centers • Data is served from Tier-1 computing centers. • In CMS Each Tier-1 is assigned a share • In ATLAS the Tier-1s have a complete analysis • The connectivity between the Tier-1 and Tier-2 centers can be substantially higher than the Tier-0 to Tier-1 rates • Already in the computing challenge the incoming rate to FNAL is half the outgoing rate to Tier-2 centers • The network that carries the Tier-2 traffic is going to be instrumental to the experiment’s success. • The Tier-2 traffic is a more difficult networking problem • The number of connections is large • There are a diverse set of locations and setups

Tier-2 and Tier-3 Centers • A Tier-2 center in ATLAS and CMS are approximately 1MSI2k of computing • Tier-3 centers belong to university groups and can be of comparable size • A Tier-2 center in ATLAS and CMS ~200TB of disk • Currently procuring and managing this volume of storage is expensive and operationally challenging • Requires a reasonably virtualization layer • A Tier-2 center has between 2.5 Gb/s and 10Gb/s of connectivity in the US • This is similar between Tier-2 and Tier-3 centers • The speed of connection to the local sites has increased rapidly • In the US-CMS planning a Tier-2 supports 40 Physicists performing analysis • This is a primary difference between a Tier-2 and a Tier-3 • Tier-2 centers are a funded effort of the experiment • The central project has expectations of them

Surviving the first years • The computing for either experiment is hardest as the detector is being understood • The analysis object data for CMS is estimated at 0.05MB • An entire year’s data and reconstruction are only 300TB • Data is divided into ~10 trigger streams and ~50 offline streams • A physics analysis should rely on 1 trigger stream • A Tier-2 could potentially maintain all the analysis objects for the majority of the analysis streams • Unfortunately, until the detector and reconstruction are completely understood the AOD is not useful for most analysis and access to the raw data will be more frequent • The full raw data is 35 times bigger • Given the experience of the previous generation of detectors, we should expect about 3 years to stabilize • People working at Tier2 centers can make substantial, but bursty requirements of the data transfers

Analysis Selections • When going back to the raw data and complete simulation, analysis selections on a complete trigger streams • 1% selection on data and MC would be 4TB, 10% selection would be 40TB • Smaller by factor of 5 if only the offline stream can be used • There are an estimated 40 people working at a Tier-2 • If half the people perform the small selections at the level of twice a month • This is already 50MB/s on average and everyone is working asynchronously • The original analysis estimates were once a week • 10% selections will happen • 100MB/s x 7 Tier-2s would be ~6Gb/s from a Tier-1 • Size of selections, number of active people and frequency of selections all have significant impact on the total network requirements • Can easily arrive at 500MB/s for bursts.

Tier-3 Connectivity • Tier-2s are a resource for the physics community. • Even people with significant university clusters at home have the opportunities to use Tier-2 • The use of Tier-3s for analysis is foreseen in the model • These are not resources for the whole experiment and can have lower priority for access to common resources • The number of active physicist supported at a Tier-3 center is potentially much smaller than a Tier-2 • 4-8 people • This leads to smaller sustained network use • but similar requirements to T2s to enable similar turn-around times/latencies for physics datasets copied to T3 sites for analysis • For CMS, the Tier-3s are similar in analysis functionality but not in capacity • Data Connectivity is expected to all Tier-1 sites

Tier-2 and Tier-3 Connectivity to Sites • The network from CERN to Tier-1s is a dedicated resource • The Tier-2 and Tier-3 connections to Tier-1 centers are expected to go over the normal backbones (Internet2, ESNET) • In the ATLAS model the Tier-2 centers connect primarily to one Tier-1. In the case of the US centers, this is Brookhaven • In the CMS model, the Tier-2 centers connect to the Tier-1 hosting the data. • About 50% of the data will be resident at FNAL • The other 50% is hosted in 5 European Tier-1 and 1 Asian Tier-2 • The European Tier-2s will also connect to FNAL for roughly half the connections • A number of the US-Tier-2s either own connectivity to StarLight or participate in projects like UltraLight • Connections to Europe will be over shared resources and peerings with european providers

Examples of Transfers • Tier-1 to Tier-2 transfers within the US with SRM disk-to-disk

Examples of Transfers (Cont.) • Transfers from FNAL to European Tier-2s

Outlook • Volume of data and the level of distribution presents new challenges for the LHC computing • Distributed computing is an integral part of the experiment’s success • Making efficient use of the network to move large quantities of data is critical to the success of distributed computing • The Tier-1 centers have custodial responsibility for raw data • They are a natural extension of the online system and the network rates are predictable. • The network for CERN to Tier-1 is dedicated • The Tier-2 centers are resources for analysis • Both experiments are learning to perform efficient data transfers • Transfers are driven by user needs and demands can be high • A lot of work to do in the last year

LHC Network Issues

LHC Network Issues

Presentation Transcript

Lab1 Network Issues

Network Issues

Network Sharing Issues

Transverse instability observation network in LHC

LHC Open Network Environment LHC ONE

Network Security Issues

Addressing Network Security Issues

Analyzing Network Issues

Reliability Issues at the LHC

LHC high-level network architecture

LHC Open Network Environment an update

Network Security Issues

LHC high-level network architecture

Tier3 Network Issues

LHC network in Asia

Network Selection Issues

LHC Community Network Performance Recommended BCP

LHC Open Network Environment LHC ONE

LHC Open Network Environment LHC ONE

Network Issues

LHC network in Asia