1 / 42

Advances in Integrated Storage, Transfer and Network Management

Advances in Integrated Storage, Transfer and Network Management. <x> Email: <x>@<y> On behalf of the Ultralight Collaboration. Networking and HEP. The network is a key element, that HEP needs to work on , as much as its Tier1 and Tier2 facilities.

carter
Download Presentation

Advances in Integrated Storage, Transfer and Network Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advances in Integrated Storage, Transfer and Network Management <x> Email: <x>@<y> On behalf of the Ultralight Collaboration

  2. Networking and HEP • The network is a key element, that HEP needs to work on, as much as its Tier1 and Tier2 facilities. • Network bandwidth is going to be a scarce, strategic resource; the ability to co-schedule network and other resources will be crucial • US CMS Tier1/Tier2 experience has shown this (tip of the iceberg) • HEP needs to work with US LHCNet network providers/operators/managers,and experts on the upcoming circuit-oriented management paradigm • Joint planning is needed, for a consistent outcome • The provision of managed circuits is also a direction for ESNet and Internet2, but LHC networking will require it first • We also need to exploit now-widely available technologies: • Reliable data transfers, monitoring in-depth, end-to-end workflow information (including every network flow, end system state and load) • Risk: lack of efficient workflow, readiness  lack of competitiveness • Lack of engagement has made this a DOE-funding issue as well; even the planned bandwidth may not be there

  3. Transatlantic Networking:Bandwidth Management with Channels • The Planned bandwidth for 2008-2010 is already well below the capabilities of current end-systems, appropriately tuned • Already evident in intra-US production flows(Nebraska, UCSD, Caltech)  • “Current capability” limit of nodes is still an order of magnitude greater than best results seen in production so far • This can be extended to transatlantic flows, using a series of straightforward technical steps: configuration, diagnosis, monitoring • Involving the end-hosts and storage systems, as well as networks • The current lack of transfer volume from non-US Tier1s sites is not, and cannot be an indicator of future performance • Upcoming paradigm: managed bandwidth channels • Scheduled: multiple classes of “work”, settable numberof simultaneous flows per class; like computing facility queues • Monitored and managed: BW will be matched to throughput capability [schedule coherently; free up excess BW resources as needed] • Respond to: errors; schedule changes; progress-rate deviations

  4. Computing, Offline and CSA07 (CMS) US Tier1 – Tier2 Flows Reach 10 Gbps 1/2007 Snapshot

  5. Computing, Offline and CSA07 One well configured site. What happens if there are 10 such sites? Nebraska

  6. Transatlantic Networking:Bandwidth Management with Channels • The Planned bandwidth for 2008-2010 is already well below the capabilities of current end-systems, appropriately tuned • Already evident in intra-US production flows(Nebraska, UCSD, Caltech)  • “Current capability” limit of nodes is still an order of magnitude greater than best results seen in production • This can be extended to transatlantic flows, using a series of straightforward technical steps: configuration, diagnosis, monitoring • Involving the end-hosts and storage systems, as well as networks • The current lack of transfer volume from non-US Tier1s sites is not, and cannot be an indicator of future performance • Upcoming paradigm: managed bandwidth channels • Scheduled: multiple classes of “work”, settable numberof simultaneous flows per class; like computing facility queues • Monitored and managed: BW will be matched to throughput capability [schedule coherently; free up excess BW resources as needed] • Respond to: errors; schedule changes; progress-rate deviations

  7. Internet2’s New Backbone Level(3) Footprint;Infinera 10 X 10G Core;CIENA Optical Muxes • Initial deployment – 10 x 10 Gbps wavelengths over the footprint • First round maximum capacity – 80 x 10 Gbps wavelengths; expandable • Scalability – potential migration to 40 Gbps or 100 Gbps capability • Transition to NewNet underway now (since 10/2006), until end of 2007

  8. Network Services for Robust Data Transfers: Motivation • Today’s situation • End-hosts / applications unaware of network topology, available paths, bandwidth available (with other flows) on each • Network unaware of transfer schedules, end-host capabilities • High performance networks are not a ubiquitous resource • Applications might get full BW only on the access part of network • Usually only to the next router • Mostly not beyond the network under the site’s control • Better utilization needs joint management; work together • To schedule and allocate Network resources along with storage and CPU: using queues, and quotas where needed Data Transfer applications can no longer treat network as a passive resource. They can only be efficient if they are “network-aware”: communicating withand delegating many of their functions to network-resident services

  9. Data Transfer Experience • Transfers are generally much slower than expected, or slow down, or even stop. Many potential causes remain undiagnosed: • Configuration problem ? Loading ? Queuing ? Errors ? etc. • End-host problem, or network problem, or application failure ? • Application (blindly) retries, and perhaps sends emails through hypernews, to request “expert” troubleshooting • Insufficient information, too slow to diagnose and correlate at the time the error occurs • Result: lower transfer rates, people spending more and more time to troubleshoot • Can a manual, non-integrated approach scale to a collaboration the size of CMS/ATLAS when data transfer needs will increase; where we will compete for (network) resources with other experiments (and sciences)‏ ? • Phedex (CMS) and FTS are not the only end-users’ applications that face transfer challenges (Ask Atlas for example)‏

  10. Managed Network Services Integration • The Network alone is not enough • Need proper integration and interaction needed with the end-hosts, or storage systems • Multiple applications will want to use the network resource • ATLAS; ALICE and LHCb • Private transfer tools; Analysis applications streaming data • Real-time streams (e.g. videoconferencing) • No need for different domains to develop their own monitoring and troubleshooting tools for a resource that they share • Manpower intensive; duplication of effort; “reinvention” • Lack of sufficient operational expertise, lack of access to sufficient information, and lack of control • So these systems will (and do) lack the needed functionality, performance, and reliability‏

  11. Analogous to the real-time operation and internal monitoring and diagnosis Managed Network Services Strategy (1) • Network Services Integrated with End Systems • Need to use a real-time, end-to-end view of the network and end-systems: based on end-to-end monitoring, in depth • Need to Extract and Correlate information (e.g. network state, end-host state, transfer queues-state) • Solution: Network and end-host exchange information via real-time services and “end-host agents” (EHA) • Provide sufficient information for decision support • EHAs and network services can cooperate, to Automate some operational decisions, based on accumulated experience • especially where they become reliable enough • automation will become essential as the usage and number of users scale up, and competition increases

  12. Real-time control and adaptation: an optimizable system Managed Network Services Strategy (2) • [Semi-]Deterministic Scheduling • Receive request: “Transfer N bytes from Site A to B with throughput R1”; authenticate/authorize/prioritize • Verify end-host (HW, config.) capability (R2); schedule bandwidth B > R2; estimate time to complete T(0) • Schedule path, with priorities P(i) on segment S(i) • Check periodically: compare rate R(t) to R2, compare and update time to complete T(i) to T(i-1) • Triggers: error (e.g. segment failure); variance beyond thresholds (e.g. progress too slow; channel underutilized, wait in queue too long); state change (e.g. new high priority transfer submitted) • Dynamic Actions: build alternative path; change channel size; create new channel and squeeze others in class, etc.

  13. DTSS (fully distributed) Network Aware Scenario: Link- Segment Outage (4) Notify (2) Notify (3) Reroute/Circuit Setup NMA NMA: Network Monitoring Agent EHA: End-Host Agent DTSS: Dataset Transfer & Scheduling Service (1) Monitoring EHA EHA Data Transfer End-system (Source)‏ (1) NMA detects a problem (e.g. link-segment down) • (2) Notifies DTSS End-system (Target)‏ End hosts and applications must interact with network services to achieve this • (3) DTSS re-routes the traffic if bandwidth available on a different route • Transparent to end-systems • Rerouting takes into account transfer priority (4) Otherwise if no alternative, DTSS notifies EHAs DTSS puts transfer on hold

  14. Network Aware Scenario: End-Host Problem DTSS (fully distributed) (4) Notify NMA (2)Adjust NMA: Network Monitoring Agent EHA: End-Host Agent DTSS: Dataset Transfer & Scheduling Service (3) Monitoring (1) Monitoring (1) Notify EHA EHA (1) EHA detects problem with end-host (e.g. throughput on NIC lower than expected)‏ • Notifies DTSS (2) DTSS squeezes the corresponding circuit • Gives more bandwidth to the next highest priority transfer sharing (part of) the path • (3) NMA senses increased throughput ; Notify DTSS (4) Data transfer End-system (Target)‏ End-system (Source)‏ Network AND End-host monitoring allow Root Cause Analysis! BW used by end-system (source)‏ Back to nominal BW NIC Problem detected Problem resolved, sensed by network monitoring

  15. End Host Agent (EHA)‏ CPU Mem  • Authorization; Service discovery; System configuration • Complete local profiling of host hardware and software • Complete monitor of host processes, activity, load • End to end performance measurements • Act as active listener to “events” generated by local apps Disk  System Net Disk I/O Network • EHA implements secure APIs that applications can use for requests and interactions with network services. • EHA continuously receives estimations of time-to-complete its requests • Where possible, EHA will help provide correct system & network configuration to the end-hosts. Minimize!

  16. End Host/User Agent is a Reality in EVO (also FDT, VINCI, etc.) EHA monitors various system & network parameters Key Characteristic A coherent real-time architecture of inter-communicating agents and services with correlated actions Enable (suggested) automatic actions based on anomalies to ensure proper quality (e.g.: automatically limit video if bandwidth is limited)‏

  17. Specialized Agents Supporting Robust Data Transfer Operations Example Agent Services Supporting Transfer Operations & Management • Build a real-time, dynamic map of the inter-connection topology among the services, & the channels on each link-segment, including: • How the channels are allocated (bandwidth, priority) and used • Information on the total traffic and major flows on each segment • Locate network resources & services which have agreements with a given VO • List routers and switches along each path that are “manageable” • Determine which path-options exist between source and destination • Among N replicas of a data source, and M possible network paths, “Discover” (with monitoring) the least estimated time to completion of a transfer to a given destination; take reliability-record into account • Detect problems in real-time such as : • Loss or impairment of connectivity; crc errors; asymmetric routing • Large rate of lost packets, or retransmissions (above a threshold)

  18. b • Implemented as a set of collaborating agents: No single point of failure The Scheduler • Independent agents control different segments or channels • Negotiate for end-to-end connection, using cost functions to find “lowest-cost” path (on the topological graph): based on VO, priorities, pre-reservations, etc. • An agent offers a “lease” for use of a segment or channel to its peers • Periodic lease renewals are used by all agents • Allows a flexible response of the agents to task completion, application failure or network errors. • If a segment failure or impairment is detected, DTSS automatically provides path-restoration based on the discovered network topology • Resource reallocation may be applied to preserve high priority tasks • The supervising agents then release all other channels or segments along a path: • An alternative path may be set up: rapidly enough to avoid a TCP timeout, and thus to allow the transfer to continue uninterrupted.

  19. Dynamic Path Provisioning & Queueing • Channel allocation based on VO/Priority, [ + Wait time, etc.] • Create on demand a End-to-end path or Channel & configure end-hosts • Automatic recovery (rerouting) in case of errors • Dynamic reallocation of throughputs per channel: to manage priorities, control time to completion, where needed • Reallocate resources requested but not used Request User Scheduling Control Realtime Feedback Monitoring End Host Agents

  20. Dynamic end-2-end optical path creation & usage 200+ MBytes/sec From a 1U Node FDT Transfer Minimize emails like this! 4 fiber cut emulations

  21. FDT Disk-to-Disk WAN transfersBetween Two End-hosts 1U Nodes with 4 Disks 4U Disk Servers with 24 Disks CERN CALTECH NEW YORK GENEVA Reads and writes on two 12-port RAID Controllers in parallel on each server Mean traffic ~ 545 MB/s ~ 2 TB per hour Reads and writes on 4 SATA disks in parallel on each server Mean traffic ~ 210 MB/s ~ 0.75 TB per hour MB/s Working on integrating FDT with DCache

  22. Expose system information that cannot be handled automatically Drill-down, with simpleauto-diagnostics, using ML/APMon in ALIEN RSS feeds and email alerts when anomalies occur whose correction is not automated Incrementally try to automate these anomaly-corrections (if possible)‏based on accumulated experience

  23. System Architecture: the Main Services Application Application End User Agent End User Agent System Evaluation & Optimization Authentication, Authorization, Accounting Scheduling & Queues Dynamic Path & Channel Building, Allocation Prediction Failure Detection Topology Discovery Control Path Provisioning Learning GMPLS MPLS OS SNMP Several services are already operational and field tested MONITORING

  24. System Architecture: Scenario View

  25. Interoperation with Other Network Domains

  26. It is important is to establish a smooth transition from the current situation to the one envisioned without interruption of currently available services Benefits to LHC • Higher performance transfers • Better utilization of our network resources • Applications interface with network services & monitoring • Instead of creation of their own, so Less manpower required from CMS • Decrease email traffic • Less time spent on support; Less manpower from CMS • Providing end-to-end monitoring and alerts • More transparency for troubleshooting (if it is not automated)‏ • Incrementally work on automation of anomaly-handling where possible;based on accumulated experience • Synergy with US funded projects • Manpower is available to support the transition • Distributed system, with pervasive monitoring and rapid reaction time • Decrease single points of failure • Provide information and diagnosis that is not otherwise possible • Several services are already operational and field tested We intend progressive developmentand deployment (not “all at once”). But there are strong benefits in early deployment: cost, efficiency, readiness

  27. Resources: For More Information • End Host Agent (LISA): • http://monalisa.caltech.edu/monalisa__Interactive_Clients__LISA.html • Alice monitoring and alerts: • http://pcalimonitor.cern.ch/ • Fast data transfer: • http://monalisa.cern.ch/FDT/ • Network services: • http://monalisa.caltech.edu/monalisa__Service_Applications__Vinci.html • EVO with End Host Agent: • http://evo.caltech.edu/ • UltraLight: http://ultralight.org See the various posters on this subject at CHEP

  28. Extra Slides Follow

  29. US LHCNet, ESnet Plan 2007-2010:30-80Gbps US-CERN, ESnet MAN AsiaPac US-LHCNet: Wavelength Quadrangle NY-CHI-GVA-AMS 2007-10: 30, 40, 60, 80 Gbps SEA Europe Europe ESnet4 SDN Core: 30-50G Aus. BNL Japan Japan SNV CHI NYC GEANT2 SURFNet IN2P3 DEN DC Metro Rings FNAL Aus. ESnet IP Core ≥10 Gbps ALB SDG ATL CERN ELP ESnet hubs New ESnet hubs US-LHCNet Network Plan (3 to 8 x 10 Gbps US-CERN) Metropolitan Area Rings 10Gb/s 10Gb/s 30Gb/s2 x 10Gb/s Major DOE Office of Science Sites High-speed cross connects with Internet2/Abilene Production IP ESnet core, 10 Gbps enterprise IP traffic Science Data Network core, 40-60 Gbps circuit transport Lab supplied Major international ESNet MANs to FNAL & BNL; Dark Fiber to FNAL; Peering With GEANT LHCNet Data Network NSF/IRNC circuit; GVA-AMS connection via Surfnet or Geant2

  30. FDT: Fast Data Transport SC06 Results 11/14 – 11/15/06 New Capability Level: ~70 Gbps per rack of low cost 1U servers Efficient Data Transfers • Reading and writing at disk speed over WANs (with TCP) for the first time • SC06 Results: 17.7 Gbps on one link; 8.6 Gbps to/from Korea • In Java: Highly portable, runs on all major platforms. • Based on an asynchronous, multithreaded system • Streams a dataset (list of files) continuously, from a managed pool of buffers in kernel space, through an open TCP socket • Smooth data flow from each disk to/from the network • No protocol start-phase between files • Stable disk-to-disk flows Tampa-Caltech: Stepping up to 10-to-10 and 8-to-8 1U Server-pairs 9 + 7 = 16 Gbps; then Solid overnight. Using One 10G link I. Legrand

  31. Dynamic Network Path Allocation & Automated Dataset Transfer >mlcopy A/fileX B/path/ OS path available Configuring interfaces Starting Data Transfer Real time monitoring MonALISA Distributed Service System Internet Regular IP path APPLICATION DATA MonALISA Service Monitor Control A OS Agent B TL1 LISA Agent LISA AGENT sets up - Net Interfaces - TCP stack - Kernel - Routes LISA  APPLICATION “use eth1.2, …” Optical Switch Active light path Detects errors and automatically recreates the path in less than the TCP timeout (< 1 second)

  32. MonALISA MonALISA MonALISA ML Agent ML Agent ML Agent ML Agent ML Agent ML Agent The Functionality of the VINCI System ML proxy services Layer 3 ROUTERS Agent ETHERNET LAN-PHYor WAN-PHY Layer 2 Agent Agent DWDM FIBER Layer 1 Agent Agent Site A Site B Site C

  33. NETWORKS ROUTERS AS Monitoring Network Topology, Latency, Routers Real Time Topology Discovery & Display

  34. Four Continent Testbed Building a global, network-aware end-to-end managed real-time Grid

  35. Scenario : pre-emption of lower priority transfer End-system 2 (Source 2)‏ DTSS (1) Request EHA DTTS: Data Transfer & Scheduling Service NMA: Network Monitoring Agent EHA: End-Host Agent (3) Notify (3)Notify (2)Circuit Setup EHA EHA (1) At time t0, a high priority request (e.g. T0 – T1 transfer) is received by DTSS (2) DTSS • squeezes the normal priority channel • allocates necessary bandwidth to high priority transfer (3) DTSS notifies EHAs At the end of the high priority transfer (t1), DTSS: • restores original allocated bandwidth and notifies EHAs Data Transfer End-system 1 (Source 1)‏ End-system 3 (Target)‏ Allocated bandwidth, High priority Allocated bandwidth, Normal priority New Allocated bandwidth, High priority t0 t1

  36. Advanced examples Reservation cancellation by user • Transfer request cancellation has to be propagated to the DTSS • Not enough to cancel the job on end-host • DTSS will schedule next transfer in queue ahead of time • Application might not have data ready (e.g. staging from tape) • DTSS has to communicate with EHAs which could profit from early schedule, end host has to advertise when it can accept the circuit Reservation extension (transfer not finished on schedule)‏ • If possible, DTSS keeps the circuit • Same bandwidth if available • Take transfer priority into account • Possibly with quota limitation • Progress tracking: • EHA has to announce remaining transfer size • Recalculate Estimated Time to Completion taking into account new bandwidth

  37. Application in 2008 (Pseudo-log) Node1> ts –v –in mercury.ultralight.org:/data01/big/zmumu05687.root –out venus.ultralight.org:/mstore/events/data –prio 3 –deadline +2:50 –xsum -TS: Initiating file transfer setup… -TS: Contact path discovery service (PDS) through EHA for transfer request -PDS: Path discovery in progress… -PDS: Path RTT 128.4 ms, best effort path bottleneck is 10 GE -PDS: Path options found: -PDS: Lightpath option exists end-to-end -PDS: Virtual pipe option exists (partial)‏ -PDS: High-performance protocol capable end-systems exist -TS: Request 1.2 TB file transfer within 2 hours 50 minutes, priority 3 -TS: Target host confirms available space for DN=smckee@ultralight.org -TS: EHA contacted…parameters transferred -EHA: Priority 3 request allowed for DN=smckee@ultralight.org -EHA: request scheduling details -EHA: Lightpath prior scheduling (higher/same priority) precludes use -EHA: Virtual pipe sizeable to 3 Gbps available for 1 hour starting in 52.4 minutes -EHA: request monitoring prediction along path -EHA: FAST-UL transfer expected to deliver 1.2 Gbps (+0.8/-0.4) averaged over next 2 hours 50 minutes

  38. A network path forwarding service to interface production facilities with advanced research networks: Goal is selective forwarding on a per flow basis & Alternate network paths for high impact data movement Dynamic path modification, with graceful cutover & fallback Lambda Station interacts with: Host applications & systems LAN infrastructure Site border infrastructure Advanced technology WANs Remote Lambda Stations Hybrid Nets with Circuit-Oriented Services: Lambda Station Example Also OSCARs, Terapaths, UltraLight D. Petravick, P. DeMar

  39. M10 TeraPaths: BNL and Michigan; Partnering with OSCARS (ESnet), LStation (FNAL) and DWMI (SLAC) • Investigate:Integration & Use of LAN QoS and MPLS-BasedDifferentiated Network Services in the ATLAS Distributed Computing Environment • As a Way to Manage the Network As a Critical Resource OSCARS Network Usage Policy MPLS and Remote LAN QoSrequests LAN/MPLS ESnet TeraPathsResource Manager IN(E)GRESS LAN/MPLS Grid AA LAN QoS Monitoring Bandwidth Requests & Releases Remote TeraPaths GridFtp & SRM Data Management Data Transfer SE Traffic Identification: addresses, port #, DSCP bits Dantong Yu

  40. Application in 2008 • Node1> ts –v –in mercury.ultralight.org:/data01/big/zmumu05687.root –out venus.ultralight.org:/mstore/events/data –prio 3 –deadline +2:50 –xsum • -TS: Initiating file transfer setup… • -TS: Contact path discovery service (PDS) through EHA for transfer request • -PDS: Path discovery in progress… • -PDS: Path RTT 128.4 ms, best effort path bottleneck is 10 GE • -PDS: Path options found: • -PDS: Lightpath option exists end-to-end • -PDS: Virtual pipe option exists (partial)‏ • -PDS: High-performance protocol capable end-systems exist • -TS: Request 1.2 TB file transfer within 2 hours 50 minutes, priority 3 • -TS: Target host confirms available space for DN=smckee@ultralight.org • -TS: EHA contacted…parameters transferred • -EHA: Priority 3 request allowed for DN=smckee@ultralight.org • -EHA: request scheduling details • -EHA: Lightpath prior scheduling (higher/same priority) precludes use • -EHA: Virtual pipe sizeable to 3 Gbps available for 1 hour starting in 52.4 minutes • -EHA: request monitoring prediction along path • -EHA: FAST-UL transfer expected to deliver 1.2 Gbps (+0.8/-0.4) averaged over next 2 hours 50 minutes

  41. The main components for Scheduler Request Network Control & Configuration Scheduling Options Dedicated Agents Decision End Hosts Control & Configuration User Possible path / channels options for the requested priority Progress Monitor Priority Queues for each segment Reservations & Active Transfers lists Topological Graph and traffic information Failures Detection Applications (Data Transfers, Storage) Monitoring Network & end hosts monitoring SNMP, sFlow, TL1, trace, ping AvBw, /proc , TCP conf. ApMon, dedicated modules

  42. Learning and Prediction • We must also consider the applications that do not use our APIs and to provide effective network services for all the applications. • Learning algorithms (like Self Organizing Neural Networks) will be used to evaluate the traffic created by other applications, to identify major patters and to dynamically setup effective connectivity maps based on the monitoring data. • It is very difficult if not impossible to assume that we could predict all possible events in a complex environment like the GRID, and compile the knowledge about those events in advance. • Learning is the only practical approach in which agents can acquire necessary information to describe their environments. • We approach this multi-agent learning task from two different levels: • the local level of individual learning agents • the global level of inter-agent communication • We need to ensure that each agent can be optimized from local perspective, while the globally monitoring mechanism acts as a ‘driving force’ to evolve the agents collectively based on the previous experience .

More Related