390 likes | 542 Views
LHCnet : Proposal for LHC Network infrastructure extending globally to Tier2 and Tier3 sites . Artur Barczyk, Harvey Newman California Institute of Technology / US LHCNet LHCT2S Meeting CERN, January 13 th , 2011. The problem to solve. LHC Computing Infrastructure. WLCG in brief:
E N D
LHCnet: Proposal for LHC Network infrastructure extending globally to Tier2 and Tier3 sites Artur Barczyk, Harvey Newman California Institute of Technology / US LHCNet LHCT2S Meeting CERN, January 13th, 2011
LHC Computing Infrastructure • WLCG in brief: • 1 Tier-0 (CERN) • 11 Tiers-1s; 3 continents • 164 Tier-2s; 5 (6) continents • Plus O(300) Tier-3s worldwide
CMS Data Movements (All Sites and Tier1-Tier2) 120 Days June-October 2.5 2 Daily average T1-T2 rates reach 1-1.8 GBytes/s Daily average total rates reach over 2 GBytes/s 2 1.5 1.5 Throughput [GBy/s] 1 1 0.5 0.5 0 0 6/19 7/03 7/17 7/31 8/14 8/28 9/11 9/25 10/9 6/23 7/07 7/21 8/4 8/18 9/1 9/15 9/29 10/13 Tier2-Tier2 ~25%of Tier1-Tier2 Traffic To ~50% during Dataset Reprocessing & Repopulation 132 Hours Last Week 4 1 hour average: to 3.5 GBytes/s 3 Throughput [GBy/s] 2 1 0 10/7 10/8 10/9 10/10 10/6
Worldwide data distribution and analysis (F.Gianotti) Total throughput of ATLAS data through the Grid: 1st January November. MB/s per day 6 GB/s ~2 GB/s (design) Peaks of 10 GB/s reached Grid-based analysis in Summer 2010: >1000 different users; >15M analysis jobs The excellent Grid performance has been crucial for fast release of physics results. E.g.: ICHEP: the full data sample taken until Monday was shown at the conference Friday 5
Changing LHC Data Models • 3 recurring themes: • Flat(ter) hierarchy: Any site might in the future pull data from any other site hosting it. • Data caching: Analysis sites will pull datasets from other sites “on demand”, including from Tier2s in other regions • Possibly in combination with strategic pre-placement of data sets • Remote data access: jobs executing locally, using data cached at a remote site in quasi-real time • Possibly in combination with local caching • Expect variations by experiment
Remote Data Access and Local Processing with Xrootd (CMS) • Useful for smaller sites with less (or even no) data storage • Only selected objects are read (with object read-ahead). No transfer of entire data sets • CMS demonstrator: Omaha diskless Tier3, served data from Caltech and Nebraska (Xrootd) Strategic Decisions: Remote Access vs Data Transfers Similar operations in ALICE for years Brian Bockelman, September 2010
Requirements summary (from Kors’ document) • Bandwidth: • Ranging from 1 Gbps (Minimal site) to 5-10Gbps (Nominal) to N x 10 Gbps (Leadership) • No need for full-mesh@ full-rate, but several full-rate connections between Leadership sites • Scalability is important, • sites are expected to migrate Minimal Nominal Leadership • Bandwidth growth: Minimal = 2x/yr, Nominal&Leadership = 2x/2yr • “Staging”: • Facilitate good connectivity to so far (network-wise) underserved sites • Flexibility: • Should be able to include or remove sites at any time • Budget Neutrality: • Solution should be cost neutral [or at least affordable, A/N]
Lessons learned • The LHC OPN has proven itself, shall learn from it • Simple architecture • Point-to-point Layer 2 circuits • Flexible and scalable topology • Grew organically • From star to partial mesh • Open to several technology choices • each of which satisfies requirements • Federated governance model • Coordination between stakeholders • No single administrative body required • Made extensions and funding straight-forward • Remaining challenge: monitoring and reporting • More of a systems approach
Design Inputs • By the scale, geographical distribution and diversity of the sites as well as funding, only a federated solution is feasible • The current LHC OPN is not modified • OPN will become part of a larger whole • Some purely Tier2/Tier3 operations • Architecture has to be Open and Scalable • Scalability in bandwidth, extent and scope • Resiliency in the core, allow resilient connections at the edge • Bandwidth guarantees determinism • Reward effective use • End-to-end systems approach • Operation at Layer 2 and below • Advantage in performance, costs, power consumption
Design Inputs, cont. • Most/all R&E networks (technically) can offer Layer 2 services • Where not, commercial carriers can • Some advanced ones offer dynamic (user controlled) allocation • Leverage as much as possible on existing infrastructures and collaborations • GLIF, DICE, GLORIAD, … • Last but not least: • This would be the perfect occasion to start using IPv6, therefore we should, (at least) encourage IPv6, but support IPv4 • Admittedly the challenge is above Layer 3
Design Proposal • A design satisfying all requirements: Switched Core with Routed Edge • Sites interconnected through Lightpaths • Site-to-site Layer 2 connections, static or dynamic • Switching is far more robust and cost-effective for high-capacity interconnects • Routing (from end-site viewpoint)is deemed necessary
Switched Core • Strategically placed core exchange points • E.g. start with 2-3 in Europe, 2 in NA, 1 in SA, 1-2 in Asia • E.g. existing devices at Tier1s, GOLEs, GEANT nodes, … • Interconnected through high capacity trunks • 10-40 Gbps today, soon 100Gbps • Trunk links can be CBF, multi-domain Layer 1/ Layer 2 links, … • E.g. Layer 1 circuits with virtualised sub-rate channels,sub-dividing 100G links in early stages • Resiliency, where needed, provided at Layer 1/ Layer 2 • E.g. SONET/SDH Automated Protection Switching, Virtual Concatenation • At later stage, automated Lightpath exchanges will enable a flexible “stitching” of dynamic circuits • See demonstration (proof of principle) at last GLIF meeting and SC10
One Possible Core Technology: Carrier Ethernet • IEEE standard 802.1Qay (PBB-TE) • Separation of backbone and customer network through MAC-in-MAC • No flooding, no Spanning Tree • Scalable to 16 M services • Provides OAM comparable to SONET/SDH • 802.3ag, end-to-end service OAM • Continuity Check Message, loopback, linktrace • 802.3ah, link OAM • Remote loopback, loopback control, remote failure indication • Cost Effective • e.g. NSP study indicates TCO ~43% lower for COE (PBB-TE) vsMPLS-TE • 802.1Qay and ITU-T G.8031 Ethernet Linear Protection Standard provides 1+1 and 1:1 protection switching • Similar to SONET/SDH APS • Works by Y.1731 message exchange (ITU-T standard)
Routed Edge • End sites (might) require Layer 3 connectivity in the LAN • Otherwise a true Layer 2 solution might be adequate • Lightpaths terminate on a site’s router • Site’s border router, or, preferably, • Router closest to the storage elements • All IP peerings are p2p, site-to-site • Reduces convergence time, avoids issues with flapping links • Each site decides and negotiates with which remote site it desires to peer (e.g. based on experiment’s connectivity design) • Router (BGP) advertises only the SE subnet(s) through the configured Lightpath
Lightpath termination • Avoid LAN connectivity issueswhen terminating lightpath atcampus edge • Lightpath should be terminated as close as possible to the Storage Elements, but can be challenging if not impossible (support a dedicated border router?) • Or, provide a “local lightpath”(e.g. a VLAN with proper bandwidth, or a dedicated linkwhere possible); border routerdoes the “stitching”
IP backup • Foresee IP routed paths as backup • End-site’s BR is configured for both default IP connectivity, and direct peering through Lightpath • Direct peering takes precedence • Works also for dynamic Lightpaths • For full dynamic Lightpath setup,dynamic end-siteconfiguration throughe.g. LambdaStationorTeraPaths will beused
Resiliency • Resiliency in the core is provided by protection switching depending on technology used between core nodes • SONET/SDH or OTN protection switching (Layer 1) • MPLS failover • PBB-TE protection switching • Ethernet LAG • Sites can opt for additional resiliency (e.g. where protected trunk links are not available) by forming transit agreements with other site • akin to the current LHC OPN use of CBF
Scalability • Assuming Layer 2 point-to-point operations, a natural scalability limitation is the 4k VLAN IDs • This problem is naturally resolved in • PBB-TE (802.3Qay), through MAC-in-MAC encapsulation • dynamic bandwidth allocation with re-use of VLAN IDs • Only constraint is no two connections through the same network element to use the same VLAN B-DA B-SA Ethertype 0x88A8 B-VID Ethertype 0x88E7 I-SID Customer Frame incl. Header+FCS B-FCS
How do End-Sites Connect?A Simple Example • A Tier2 in Asia needs 1 Gbps connectivity (each) to 2 sites in Europe, 2 in US and the ASGC Tier1 • 5 x 1G intercontinental circuits is cost-prohibitive • The Tier2 could however afford a 1-2 Gbps (e.g. EoMPLS) circuit to next GOLE (e.g. HKOP, KRLight, TaiwanLight, T-LEX) • Through NREN(s) or commercial circuits • The GOLE connects to Starlight, NetherLight (trunks) and has a connection to ASGC (example) • Static bandwidth allocation (first stage): • The end-site has a 1Gbps link, with 5 VLANS, each one terminating at one of the desired remote sites • Bandwidth is allocated by the exchange points to fit the needs • Dynamic allocation (early adopter + later stage): • The end-site has a 1Gbps link, with configurable remote end-points and bandwidth allocation
Monitoring and Reporting • Pervasive monitoring of status and utilisation is a must! • Robust (100% monitoring up-time) • Resilient • Reliable • Real-time • End-to-end • Candidate 1: MonALISA monitoring system, used in US LHCNet, and at large scale e.g. in the ALICE experiment • From US LHCNet experience: it has all the components, and is proven to be scalable to satisfy the requirements • See e.g. LHC OPN presentation on MonALISA in US LHCNet:http://indico.cern.ch/getFile.py/access?subContId=1&contribId=15&resId=0&materialId=slides&confId=80755 • Candidate 2: PerfSONAR, building up on set of community developed tools
Dynamic Lightpaths - Intro • Kors’ requirements document: “[…] the backbone does not need to support all possible connections at full speed all the time. The backbone does need to support several full speed connections between the leadership Tier2s simultaneously.” • Dynamic Lightpaths provide temporary bandwidth allocation on as-needed basis • Connection reservation between any pair of sites for the requested amount of time (only) • Deployed in several R&E networks (ESnet, Internet2, SURFnet, US LHCNet), • Pilots being prepared in others (GEANT + selected NRENs) • DYNES instrument, interconnecting ~40 US campuses will start deployment in early 2011
Dynamic Lightpaths in the proposed architecture • Dynamic Network Resource Allocation is a powerful tool to avoid permanent full-mesh topology, while providing flexible connectivity and resource guarantees between end-systems • Requires integration in the experiments’ software stack • We foresee to include dynamic allocation in the final design, complementing static Lightpaths between Leadership sites • Starting with early adopters, including DYNES-connected sites
DYNES Overview • What is DYNES? • A U.S-wide dynamic network “cyber-instrument” spanning ~40 US universities and ~14 Internet2 connectors • Extends Internet2’s dynamic network service “ION” into U.S. regional networks and campuses; Aims to support LHC traffic (also internationally) • Based on the implementation of the Inter-Domain Circuit protocol developed by ESnet and Internet2; Cooperative development also with GEANT, GLIF • Who is it? • Collaborative team: Internet2, Caltech, Univ. of Michigan, Vanderbilt • The LHC experiments, astrophysics community, WLCG, OSG, other VOs • The community of US regional networks and campuses • What are the goals? • Support large, long-distance scientific data flows in the LHC, other programs (e.g. LIGO, Virtual Observatory), & the broader scientific community • Build a distributed virtual instrument at sites of interest to the LHC but available to R&E community generally
DYNES Team • Internet2, Caltech,Vanderbilt,Univ. of Michigan • PI: Eric Boyd(Internet2) • Co-PIs: • Harvey Newman(Caltech) • Paul Sheldon(Vanderbilt) • Shawn McKee(Univ. of Michigan) http://www.internet2.edu/dynes
DYNES System Description • AIM: extend hybrid & dynamic capabilities to campus & regional networks. • A DYNES instrument must provide two basic capabilities at the Tier 2S, Tier3s and regional networks: • Network resource allocation such as bandwidth to ensure transfer performance • Monitoring of the network and data transfer performance • All networks in the path require the ability to allocate network resources and monitor the transfer. This capability currently exists on backbone networks such as Internet2 and ESnet, but is not widespread at the campus and regional level. • In addition Tier 2 & 3 sites require: • Hardware at the end sites capable of making optimal use of the available network resources • Two typical transfers that DYNES supports: one Tier2 - Tier3 and another Tier1-Tier2. • The clouds represent the network domains involved in such a transfer.
DYNES: Regional Network - Instrument Design • Regional networks require • An Ethernet switch • An Inter-domain Controller (IDC) • The configuration of the IDC consists of OSCARS, DRAGON, and perfSONAR. This allows the regional network to provision resources on-demand through interaction with the other instruments • A regional network does not require a disk array or FDT server because they are providing transport for the Tier 2 and Tier 3 data transfers, not initiating them. At the network level, each regional connects the incoming campus connection to the Ethernet switch provided. Optionally, if a regional network already has a qualified switch compatible with the dynamic software that they prefer, they may use that instead, or in addition to the provided equipment. The Ethernet switch provides a VLAN dynamically allocated by OSCARS & DRAGON. The VLAN has quality of service (QoS) parameters set to guarantee the bandwidth requirements of the connection as defined in the VLAN. These parameters are determined by the original circuit request from the researcher / application. through this VLAN, the regional provides transit between the campus IDCs connected in the same region or to the global IDC infrastructure.
DYNES: Tier2 and Tier3 Instrument Design • Each DYNES (sub-)instrument at a Tier2 or Tier3 site consists of the following hardware, combining low cost & high performance: • An Inter-domain Controller (IDC) • An Ethernet switch • A Fast Data Transfer (FDT) server. Sites with 10GE throughput capability will have a dual-port Myricom 10GE network interface in the server. • An optional attached disk array with a Serial Attached SCSI (SAS) controller capable of several hundred MBytes/sec to local storage. The Fast Data Transfer (FDT) server connects to the disk array via the SAS controller and runs FDT software developed by Caltech. FDT is an asynchronous multithreaded system that automatically adjusts I/O and network buffers to achieve maximum network utilization. The disk array stores datasets to be transferred among the sites in some cases. The FDT server serves as an aggregator/ throughput optimizer in this case, feeding smooth flows over the networks directly to the Tier2 or Tier3 clusters. The IDC server handles the allocation of network resources on the switch, inter-actions with other DYNES instruments related to network pro-visioning, and network performance monitoring. The IDC creates virtual LANs (VLANs) as needed.
How can DYNES be leveraged? • The Internet2 ION service has currently end-points at two GOLEs in the US: MANLAN and StarLight • A static Lightpath from any end-site to one of these two Lightpath Exchanges can be extended through ION to any of the DYNES sites (LHC Tier2 or Tier3)
Governance structure • The global scale of the LHC network basically excludes a single administrative/management unit • Needs to be under LHC community’s control • Capacity planning • Exchange point placement • Open, federated governance • Stakeholders in LHC computing shall be able to participate and contribute • LHC computing sites (Tier0/1/2/3) (directly? through WLCG? GDB?) • R&E networks • One coordinating body (open participation) • Meet regularly • Define and oversee service levels • Perform planning functions • MoUs with exchange point operators
Funding • Each site is responsible for assuring funding for its own • End-site equipment (possibly a router or port costs on campus BR) • Layer 2 connection to the next Lightpath exchange point • Monitoring device • Core network will necessitate some shared funding • Centrally organised? • Defining exchange point placement and core trunk capacities • On regional basis? • By end-sites connecting to same exchange point
Summary • We propose a robust, scalable and comparatively low-cost solution based on a switched core with routed edge architecture • Core consists of sufficient number of strategically placed exchangepoints interconnected by properly sized trunk circuits • Scaling rapidly with time as in requirements document • IP routing is implemented at the end-sites • Sites are responsible for securing proper funding for their connectivity to the core • Initial deployment to use predominantly static Lightpaths, later predominantly using dynamic resource allocation • A federated governance model has to be used due to global geographical extent and diversity of funding sources
Questions? Artur.Barczyk@cern.ch