1 / 19

Joint Genome Institute

Caltech Computing and Network Systems for CMS Network-Integrated Computing Model for LHC and HL LHC, and DOE Exascale HPC Systems in HEP’s Data Intensive Ecosystem. + INDUSTRY. New Windows On the Universe. Joint Genome Institute. Harvey Newman and Maria Spiropulu

dyan
Download Presentation

Joint Genome Institute

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Caltech Computing and Network Systems for CMS Network-Integrated Computing Model for LHC and HL LHC, and DOE Exascale HPC Systems in HEP’s Data Intensive Ecosystem + INDUSTRY New WindowsOn the Universe Joint Genome Institute Harvey Newman and Maria Spiropulu for the Caltech CMS GroupMeeting with Jason Roos(KAUST) September 6, 2017

  2. Grid Systems for High Energy Physics • Our team has continued to be the creative force behind many of the key network-related computing concepts and system deployments under-pinning the LHC program, and the preceding program (LEP 1984-2000) • The LEP Computing Model in 1984 - 1994 • The LHC Computing Model in 1998 to the Present • The first purely software-defined global web collaborative systems: VRVS (1996)  EVO (2006)  Seevogh (2012 - )  ViewMe(2015 - ) • Led the MONARC project: Models Of Networked Analysis at Regional Centresthat defined the Computing Model of the LHC Experiments • Developed the MONARC Simulation System; this led to the • MonALISA system to monitor/control real global scale systems,+ and Grid Enabled Data Analysis Environments

  3. LHC Data Grid Hierarchy: A Worldwide System Invented and Developed at Caltech (1999) Tier2 Center Tier2 Center Tier2 Center Tier2 Center 11 Tier1 and160 Tier2 + 300 Tier3 Centers ~PByte/sec ~300-1500 MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 10 – 40 to 100 Gbps Chicago Paris Taipei London Tier 2 CACR 10 to N X 10 Gbps Tier 3 Actual Vision: A Global Real Time System Institute Institute Institute Institute Physics data cache 1 to 10 Gbps Synergy with US LHCNet& ESnetState of the Art Data Networks Tier 4 Workstations The Culture of Innovation and Partnership continues,fostering a new generation of Networks: LHCONE, ANSE 6

  4. The Caltech Network Team35 Years Working Closely with Many Partners • 1982 Caltech initiated Transatlantic networking for HEP • 1985-6 Networks for Science: NSFNET, IETF, National Academy Panel • 1986-8 Assigned by DOE to build/operate LEP3Net.First US-CERN leased line; multiprotocol network for HEP (9.6 – 64 kbps) • 1987-8: Hosted IBM, who provided the first T1 TA US-CERN link ($3M/Yr) • 1989-1995: Upgrades to LEP3Net (X.25, DECNet, TCP/IP): 64 – 512 kbps • 1996 - 2005: USLIC Consortium (Caltech – CERN – IN2P3 – WHO – UNICC). Based on 2 Mbps Links, then ATM, then IP optical links • 1997:Hosted Internet2 CEO. ; CERN became Internet2’s first Int’l member • 1996-2000: Created LHC Computing Model (MONARC), & Tier2 Concept • 2002 – Present: HN Chair of the Standing Committee on Inter-regional Connectivity: Network Roadmaps, Advanced Tech, Digital Divide • 2006 – 2015: US LHCNet, co-managed by Caltech + CERN; Resilient virtualized TA network service. Caltech had primary ops + management responsibility • LA-Campus 2nd fiber pair donated for advanced R&D from Level3: 2012-15 • 2015-17: CHOPIN (HEP + IMSS) 100G link to campus with the help of CENIC Our team has continued to push frontiers of network development and fostered successful collaborations in the US and worldwide

  5. US LHCNet 1995-2015: High Performance in Challenging Environments High-Availability Transoceanic solutions with multiple links • Intercontinental links are more complex than terrestrial ones • More fiber spans, more equipment; Multiple owners • Hostile submarine environment • A week to Months to repair + EEX ESnet5 US LHCNet Link Availability US LHCNet Target Service Availability: 99.95%

  6. Caltech CMS Group’s Hybrid Physics/Engineering Team • A team with expertise in multiple areas, focused on R&D leading to production improvements in data and network operations • For the LHC experiments now and in future LHC Runs • Feeding into advanced planning for the next Runs, into the HL LHC era • New concepts: intelligent network-integrated systems; new modes of operation for DOE’s Leadership HPC facilities with petabyte datasets • Areas of Expertise • Integration of network awareness and control into the experiments’ data and workflow management in Caltech’s OliMPS, ANSE, SDN NGenIA + SENSE projects • SANDIE: Exploration of Named Data Networking (NDN) as a possible architecture replacing the current Internet, with leading NDN groups: NEU, Colorado State, UCLA, including an NDN testbed for climate science and HEP • Development (as in US LHCNet) of software driven multilayer dynamic circuits and programmable optical patch panels. A virtualized connection service applicable to megadata centers and cloud computing • State of the art long distance high throughput data transfers • Pervasive real-time monitoring of networks and end-systems, • Real-time monitoring of hundreds of XrootD data servers used by the LHC experiments supported MonALISA system • Autonomous steering and control of large distributed systems using robust agent-based architectures • Software defined networks with end-to-end flow control

  7. Key Developments: Ongoing • Creating and developing the concepts and projects that define networking for the LHC program and Beyond: • The evolving LHC Computing Model; the Tier2+3 Roles • The LHC Open Network Environment (LHCONE) • State of the art high throughput methods; together with first use of emerging server and network technologies • Scaling from 1G to 10G to 100G+ sets of flows from 1999- • Global scale real-time monitoring systems (MonALISA) • Dynamic circuits to Tier2s & 3s at > 40 US campuses (DYNES) • Network-aware integrated data management (ANSE; PanDA, PhEDEx) • Software defined networking: OpenFlow (OliMPS, ANSE); Advanced forwarding methods with OpenDaylight(Cisco Research) • Next Generation SDN Integrated Architectures for HEP and Exascale Science: SDN NGenIA, SENSE

  8. Vision: Next Gen Integrated Systems for Exascale Science: a Major Opportunity Opportunity: Exploit the Synergy among • Global operations data and workflow management systems developed by HEP programs, to respond to both steady state and peak demands • Evolving to work with increasingly diverse (HPC) and elastic (Cloud) resources WLCG 2. Deeply programmable, agile software-defined networks (SDN), emerging as multidomain network operating systems (e.g. SENSE & SDN NGenIA; Multidomainmulticontroller SDN) 3. Machine Learning, modeling and game theory: Extract key variables; optimize; move to real-time self-optimizing workflows with Reinforcement Learning.

  9. A New “Consistent Operations” Paradigm for HEP and ExascaleScience • METHOD: Construct autonomous network-resident services that dynamically interact with site-resident services, and with the experiments’ principal data distribution and management tools • To coordinate use of network, storage + compute resources, using: • Smart middleware to interface to SDN-orchestrated data flows over network paths with allocated bandwidth levels all the way to a set of high performance end-host data transfer nodes (DTNs), • Protocol agnostic traffic shaping services at the site edges and in the network core, Coupled to high throughput transfer applicationsproviding stable, predictable data transfer rates • Machine learning + system modeling and Pervasive end-to-end monitoring • To track, diagnose and optimize system operations on the fly

  10. Service Diagram: LHC Pilot

  11. SENSE: SDN for End-to-end Networked Science at the ExascaleESnet Caltech FermilabLBNL/NERSC Argonne Maryland • Mission Goals: • Improve end-to-end performance of science workflows • Enabling new paradigms: e.g. creating dynamic distributed ‘Superfacilities’. • Comprehensive Approach: An end-to-end SDN Operating System, with: • Intent-based interfaces, providing intuitive access to intelligent SDN services • Policy-guided E2E orchestration of resources • Auto-provisioningof network devices and Data Transfer Nodes • Network measurement, analytics and feedback to build resilience

  12. SC15-17: SDN Next Generation Terabit/sec Ecosystem for Exascale Science supercomputing.caltech.edu SDN-driven flow steering, load balancing, site orchestration Over Terabit/sec Global Networks SC16+: Consistent Operations with Agile Feedback Major Science Flow Classes Up to High Water Marks Preview PetaByte Transfers to/from Site Edges of Exascale Facilities With 100G -1000G DTNs LHC at SC15: Asynchronous Stageout (ASO) with Caltech’s SDN Controller TbpsRing for SC17: Caltech, Ciena, Scinet, OCC/ StarLight+ Many HEP, Network, Vendor Partners at SC16 45

  13. SDN State of the Art Development Testbed Caltech, Fermilab, StarLight, Michigan, UNESP; + CERN, Amsterdam, Korea • 13+ Openflow switches: Dell, Pica8, Inventec, Brocade, Arista; Huawei • Many 40G, N X 40G, 100G Servers: Dell, Supermicro, 2CRSI, Echostreams; and 40G and 100G Network Interfaces: Mellanox, QLogic • Caltech Equipment funded through the NSF DYNES, ANSE, CHOPIN projects, and vendor donations https://sdnlab.hep.caltech.edu UMich Starlight Real-time Auto-Discovered SDN Testbed Topology

  14. New Paradigms for Exascale Science • Tbps SDN Intercontinental Steered Flows for LHC and LSST • Kytos Controller: LSST • MulticontrollerMulti-Domain SDN • SDN NGenIA and SENSE • LHC Orchestrators • HEPCloud with Caching; Dynamic Paths • Steering with Real-time Analytics + ML (with Tofino White Boxes) • Machining Learning Platforms • Immersive VR with Ai Guide for HEP SC16: Thanks to Ecostreams, Orange Labs Silicon Valley Partners:Fermilab, PRP, USCD, Argonne, StarLight, NRL, FIU, UNESP, NERSC/LBL, CERN, ATT; SCinet, Ciena, ESnet, Internet2, CENIC, AmLight, RNP, ANSP; Dell, 2CRSI, Arista View of the Caltech Booths at SC16; SC17 Larger

  15. Caltech + OCC Booths and SCinet at SC17

  16. Caltech, OCC and Partners at SC17 • ~3 Tbps each at the Caltech and OCC booths • 1+ Tbpsbetween the Booths and ~3Tbps to the WAN • Caltech Booth:200G dedicated to Caltech campus;300G toPRP (UCSD, Stanford, UCSC, et al); 300G to Brazil+Chile via Miami; 200G to ESnet • Waveserver Ai + other DCIsin the booths: N X 100GE to 400G, 200G waves  Microcosm: Creating the Future SCinet and the Future of Networks for Science

  17. Computing and Networking at CaltechLHC Data Production and Analysis + Next-Gen Developments • The Caltech CMS Group, which designed and developed the 1st ( MONARC) LHC Computing Model and the Tier2 Center in 1998-2001 • Operates, maintains and develops several production- and R&D- oriented computing, storage and network facilities on campus: • The Caltech Tier2 facility at Powell-Booth and Lauritsen Lab(the 1st of 160 today) provides substantial and reliable computing resources to US CMS and CMS (67 kHS06 CPU, 4.8 petabyte storage) in the worldwide LHC Grid • The group oversees and develops the campus’ 100G connections to nat’l and international networks through the NSF “CHOPIN” project together with IMSS • Leading edge software defined network (SDN) and Data Transfer Node testbeds linked to CENIC, ESnet, Internet2, Starlight and key LHC sites The Caltech Tier2 was the first to deploy a 100G uplink and meet the 20G+ throughput milestone; helped other US CMS groups achieve the milestone The Caltech Tier2+3at Powell Booth Connections: 100GE link from the CHOPIN project

  18. Backup Slides Follow

  19. Tier2 Testbed: Targeting Improved Support for Data Federations and Data Ops • We have developed and tested the Use of HTTP data federations analogous to today’s AAA production infrastructure. • In collaboration with the AAA project, HTTP support has been enabled at the US redirector and the Caltech testbed. • As part of the project, the Caltech group has developed a plugin for CMSSW based on CERN/IT’s libdavix library, and is working on HTTP access with much higher performance, as well as better integration of HTTP-over-Xrootdin the OSG software stack • The group also is targeting improved file system performance, using well-engineered file system instances of the • CEPH File System: Known for high performance; engineered with SSDs + SAS3 Storage Arrays + 40G, 100G network interfaces • Partnering with Michigan (OSIRIS), UCSD; Samsung, HGST, 2CRSI • We have set up multi-GPU systems for machine learning with support from NVIDIA, Orange Labs Silicon Valley and Echostreams

More Related