200 likes | 351 Views
Advanced Network Services for Experiments (ANSE) I ntroduction: Progress, Outlook and Issues. Harvey B Newman California Institute of Technology ANSE Annual Meeting Vanderbilt May 14, 2014. ANSE: Advanced Network Services for (LHC) Experiments.
E N D
Advanced Network Services for Experiments (ANSE) Introduction: Progress, Outlookand Issues Harvey B Newman California Institute of Technology ANSE Annual MeetingVanderbiltMay 14, 2014
ANSE: Advanced Network Services for (LHC) Experiments • NSF CC-NIE Funded, 4 US Institutes: Caltech, Vanderbilt, Michigan, UT Arlington. A US ATLAS / US CMS Collaboration • Goal: provide more efficient, deterministic workflows • Method: Interface advanced network services, including dynamic circuits, with the LHC data management systems: • PanDA in (US) Atlas • PhEDEx in (US) CMS • Includes leading personnel forthe data production systems • Kauschik De (PanDA Lead) • Tony Wildish (PhEDEx Lead)
Performance measurements with PhEDEx and FDT for CMS • FDT sustained rates: ~1500MB/sec • Average over 24hrs: ~ 1360MB/sec • Difference due to delay in starting jobs • Bumpy plot due to binning and job size • T2_ANSE_Geneva & T2_ANSE_Amsterdam • High Capacity link with dynamic circuit creationbetween storage nodes • PhEDEx and storage nodes separate • 4x4 SSD RAID 0 arrays, 16 physical CPU cores / machine 24 Hour Throughput Reported by PhEDEx MBytes/sec 2000 1500 PhEDEx testbed in ANSE 1000 500 0 Throughput as reported by MonALISA 1500 1h moving average 1000 500 0 12 14 16 18 20 22 00 02 04 06 08 10
CMS: PhEDExand Dynamic Circuits Using dynamic circuits in PhEDEx allows for more deterministic workflows, useful for co-scheduling CPU with data movement Latest efforts: integrating circuit awareness into the FileDownload agent: • Prototype is backend agnostic; No modifications to PhEDEx DB • All control logic is in the FileDownload agent • Transparent for all other PhEDEx instances Testing circuit integration into the Download agentPhEDExtransfer rates (in MB/sec) Vlad Lapadatescu Tony Wildish 1200 1h moving average 1000 Seamless switchover No interruption of service 800 PhEDEx throughput on a dedicated path 600 1h moving average 400 PhEDEx throughput on a shared path (with 5 Gbps of UDP cross traffic) 200 0 08 14 20 21 22 23 00 01 02 03 04 05 06 07 09 10 11 12 13 15 16 17 18 19 21 22
Kaushik De 25M Jobs at > 100 Sites Now Completed Each Month 6X Growth in 3 Years (2010-13): A New Plateau Production and Distributed Analysis • STEP1: Import network information into PanDA • STEP2: Use network information directly to optimize workflow for data transfer/access; at a higher level than individual transfers alone • Start with simple use cases leading to measureable improvements in workflow/user experience
USE CASES Kaushik De • Faster User Analysis • Analysis jobs normally go to sites with local data:sometimes leads to long wait times due to queuing • Could use network information to assign work to ‘nearby’ sites with idle CPUs and good connectivity • Optimal Cloud Selection • Tier2s are connected to Tier1 “Clouds”, manuallyby the ops team (may be attached to multiple Tier1s) • To be automated using network info: Algorithm under test • PD2P = PanDA Dynamic Data Placement: Asynchronous usage-based • Repeated use of data or Backlog in Processing Make add’l copies • Rebrokerage of queues new data locations • PD2P is perfect for network integration • Use network for strategic replication + site selection – tested soon • Try SDN provisioning since this usually involves large datasets
DYNES (NSF MRI-R2): Dynamic Circuits Nationwide: Led by Internet2 with Caltech DYNES is extending circuit capabilities to ~50 US campusesTurns out to be nontrivial Partners: I2, Caltech, Michigan, Vanderbilt. Working with I2 and ESnet on dynamic circuits issuessoftware Extended the OSCARS scope; Transition: DRAGON to PSS, OESS http://internet2.edu/dynes Will be an integral part of the point-to-point service in LHCONE
Challenges Encountered • perfSONAR deployment status • For meaningful results, we need most LHC computing sites equipped with perfSONAR nodes. This is work in progress. • Easy to use perfSONAR API: Was missing, but a REST API has been made available recently • Inter-domain Dynamic Circuits • Intra-domain systems have been in production for some time • E.g. ESnet uses OSCARS as production tool since several years • OESS (OpenFlow-based) also in production – single domain • Inter-domain circuit provisioning continues to be hard • Implementations are fragile; Error recovery tends to require manual intervention • Holistic approach needed: pervasive monitoring + tracking of configuration state changes; intelligent clean-up and timeout handling • NSI framework needs faster standardization, adoption and implementation among the major networks, or • Future SDN-based solution: for example OpenFlow and Open Daylight
Some of the DYNES Challenges Encountered; Approaches to a Solution • Some of the issues encountered in both the control and data planes came from immaturity of the implementation at the time • Failed request left configuration on switches, causing subsequent failures • Too long time to get failure notification, blocks serialized requests • Error messages often erratic hard to find root cause of problem • End-to-end VLAN translation not always resulting in functional data plane • Static data plane configuration need changes upon upgrades • Grid certificates validity (1 year), over 40+ sites led to frequent expiration issues –not DYNES specific! • Solution: We use nagios to monitor certificate states at all DYNES sites, generating early warning to the local administrators. • Alternate solution would be to create a DYNES CA, and administer certificates in a coordinated way. Requires a responsible party. • DYNES path forward: • Working with a selected subset of sites on getting automated tests failure free • Taking input from these – propagate changes to other sites, and/or deploy NSI • If funding allows (future proposal): an SDN based multidomain solution
Efficient Long Distance End to End Throughput from the CampusOver 100G Networks Harvey Newman, ArturBarczyk, Azher Mughal California Institute of Technology NSF CC-NIE Meeting, Washington DC April 30, 2014
SC06 BWC: Fast Data Transferhttp://monalisa.cern.ch/FDT • SC06 BWC: Stable disk-to-disk flows Tampa-Caltech: 10-to-10 and 8-to-8 1U Server-pairs for9 + 7 = 16 Gbps; then Solid overnight. Using One 10G link • 17.77 Gbps BWC peak; + 8.6 Gbps to and from Korea • An easy to use open source Java application that runs on all major platforms • Uses asynch. multithreaded system toachieve smooth, linear data flow: • Streams a dataset (list of files) contin- uously through an open TCP socket • No protocol Start/stops between files • Sends buffers at rate matched to the monitored capability of end to end path • Use independent threads to read & write on each physical device I. Legrand By SC07: ~70-100 Gbps per rack of low cost 1U servers I. Legrand
Forward to 2014: Long-distance Wide Area 100G Data Transfers It’s increasingly easy to saturate 100G infrastructure with well-prepared demonstration equipment – using aggregated traffic of several hosts Caltech SC’13 Demo: Solid 99-100G Throughput on one 100G Wave; Up to 325G WAN Traffic BUT: using 100G infrastructure for production efficiently revealed several challenges 70-74 Gbps Caltech – Internet2 – ANA100 – CERN Note: single server, multiple TCP streams using FDT tool 100 G Mostly end-system related: Need IRQ affinity tuning Multi-core support (multi-threaded applications) Storage controller limitations – mainly SW driver + CPU-controller-NIC flow control ?
Network Path Layout: Caltech (CHOPIN: CC-NIE) – CENIC – Internet2 – ANA100 – Amsterdam (SURFnet) – CERN (US LHCNet) CENIC • CHOPIN: 100G Advanced Networking + Science Driver Targets for 2014-15 • LIGO Scientific Collab. • Astro Sky Surveys; VOs • Geodetic + Seismic Nets • Genomics: On-Chip Gene Sequencing Azher Mughal Ramiro Voicu Ramiro.Voicu@cern.ch
100G TCP testsCERN Caltech This Week (Cont’d) Peaks ~83Gbps on some AL2S segments You need strong and willing partners:Caltech CampusCENICInternet2 ANA-100SURFNet CERN + Engagement Ramiro.Voicu@cern.ch
100G TCP Tests CERN Caltech; An Issue • Server 2: Newer generation (E5 2690V2 Ivy Bridge); same chassis as Server 1; issue with newer CPUs and the Mellanox 40GE NICs; Engaged with the vendors (Mellanox, Intel; LSI) Server 2: Only ~12 Gbps Server 1: ~58 Gbps Expect further improvements once this issue is resolved Lessons Learned: Need a strong team with the right talents, a systems approach and especially strong partnerships: regional, national, global; manufacturers Ramiro.Voicu@cern.ch
CHOPIN Network Layout (CC-NIE Grant) 100GE Backbone capacity, operational External connectivity to major carriers including CENIC, Esnet, Internet2 and PacWave LIGO and IPAC are in process to join using 10G and 40G links Ramiro.Voicu@cern.ch
CHOPIN WAN Connections External connectivity to CENIC, Esnet, Internet2 and PacWave Able to create Layer2 paths using Internet2 OESS portal over the AL2S US footprints Dynamic Circuits through Internet2 ION over the 100GE path Ramiro.Voicu@cern.ch
CHOPIN – CMS Tier2 Integration Caltech CMS fully integrated with 100GE Backbone IP Peering with Internet2 and UFL at 100GE … ready for next LHC run Current peaks are around 8Gbps capacity, operational Ramiro.Voicu@cern.ch
Key Issue and Approach to a Solution: Next Generation System for LHC + Other Fields Present Solutions will not scale Beyond LHC Run2 We need: an agile architecture exploiting globally distributed grid, cloud, specialized (e.g. GPU) & opportunistic computing resources A Services System that moves the data flexibly and dynamically, and behaves coherently Examples do exist, with smaller but still very large scope A pervasive, agile autonomous agent architecture that deals with complexity Developed by talented system developers with a deep appreciation of networks MonALISA ALICE Grid GridTopology MonALISA Automated Transfers on Dynamic Networks Grid Job Lifelines-*
Key Issues for ANSE • ANSE is in its second year. We should develop the timeline and miiestones for the remainder of the project • We should identify and clearly delineate a strong set of deliverables to improve data management and processing during LHC Run 2 • We should communicate and discuss these with the experiments, to get their feedback and engagement • We need to deal with a few key issues: • Dynamic circuits: DYNES. We need a clear path forward;Can we solve some of the problems with SDN ? • PerfSONAR, and filling in our monitoring needs (e.g. with ML ?) • How to integrate high throughput methods in PanDA, andget high throughput (via PhEDEx) into production in CMS • We have some bigger issues arising, and we need to discussthese among the PIs and project leaders during this meeting