300 likes | 321 Views
Join the 2-3 year study bridging LANs to advanced networks by Don Petravick at Fermilab. Learn about R&E networks, ultra-high bandwidth, optical switching, and National Lambda Rail. Explore concepts like bandwidth*delay, congestion control, and TCP strategies. Discover why the industry is shifting from traditional networking models. Delve into network quality, wide area characteristics, and the impact of bandwidth*delay on network performance. Gain insights into network optimization strategies and the use of parallel TCP streams for efficient data transfer.
E N D
Lambda Station BNL D. Petravick, Fermilab October 25, 2004
Lambda Station • Fermilab (Petravick) • Caltech (Newman) • Funded by DOE Office of Science Network Research program. • 2 or 3 year investigation of issues bridging local production LANs to advanced networks. Don Petravick -- Fermilab
Flow Distribution on ESNet Don Petravick -- Fermilab
CMS Service Challenge • Even, the initial LHC service challenge would dominate ESNet. • R&E Networks out side the production network framework. • Advanced concepts • Fewer 9’s • Much bandwidth Don Petravick -- Fermilab
What’s Potential Performance? Don Petravick -- Fermilab
R&E networks in the USA • National Lambda Rail • DOE UltraScienceNet • UltraLight • LHCNet • HOPI • FNAL <-> Starlight (humbly) Don Petravick -- Fermilab
Characteristics • DOE UltraScienceNet • Scheduled availability of 1 and 10 Gbit light paths at its POPS • UltraLight • More lambdas • Optical switching (Glimmer glass switch controlled by Mona Lisa) Don Petravick -- Fermilab
Whats’ all this about? Cost: • National Fiber Infrastructure for R&E • Between Big Pops only • Lightpath based • Low cost, low-level transport • Belief that general packet routing logic at high packet rates (and perhaps with large variation in destinations) makes networks prohibitively costly. • Constrained to Circuits • Separate work to get out of the POPs,and to the data. • Higher-layer agnostic • General Transport (e.g. IP, fibre channel,,,,,) Don Petravick -- Fermilab
Quality • Immense efforts on network weather and network quality for shared networks • Highest performance is achieved by knowledgeable, careful administration. • Over the WAN? Consistent Multiple Occurrences of such care. • The inside of an optical path should be • Congestion-free • Loss less Except for bit-errors And be measurably so in a straightforward way. • More lambdas Don Petravick -- Fermilab
Why? • Are do we seem to have created an industry? • Doesn’t this just work with IP? • Why are people tinkering with what seems to be a successful model? • Naïve views from a network-aware HEP Storage System fellows (HEPIX talks) Don Petravick -- Fermilab
Wide Area Characteristics • Most prominent characteristic, compared to LAN, is the very large bandwidth*delay product. • Underlying structure – it’s a packet world! • Possible to use pipes between specific sites • These circuits can be both static and dynamic • Both IP and non-IP (for example, Fibre-channel over sonet) • FNAL has proposed investigations and has just begun studies with its storage systems to optimize WAN file transfers using pipes. Don Petravick -- Fermilab
Bandwidth*Delay • At least bandwidth*delay bytes must be kept in flight on the network to maintain bandwidth. • This fact is independent of protocol. • Current practice uses more than this lower limit. For example, US CMS used ~2x for their DC04. • CERN <–> FNAL has a measured ~60 ms delay • Using the 2x factor, 120 ms delay gives • 30 MB/sec ~3-4 MB “in flight” • 1000 MB/sec ~120 MB “in flight” Don Petravick -- Fermilab
Bandwidth*Delay and IP • Given a single lost packet and a standard MTU size of 1500 bytes, the host will receive many out-of-order packets before receiving the retransmitted missing packet. • Must incur at least 2 “delays worth” • FNAL <-> CERN (2*60 ms delay) • 30 MB/sec: more than 2400 packets • 1000 MB/sec: more than 80000 packets Don Petravick -- Fermilab
Knee-Cliff-Collapse Model • When load on a segment approaches a threshold, a modest increases in throughput is a accompanied by a great increases delay. • Even more throughput results in congestion collapse. • Can not load a network arbitrarily. • TCP tries to avoid collapse, but its solution has problems at large bandwidth*delay Don Petravick -- Fermilab
Bandwidth and Delay and TCP • Stream model of TCP implies packet buffering is in kernel - this leads to kernel efficiency issues. • Vanilla TCP behaves as if all packet loss is caused by congestion. • TCP Solution is to back off throughput to avoid the congestion collapse in AIMD fashion: • Lost packet? Cut packets in flight by ½ • Success? Open window next time by one more packet • This leads to a very large recovery time at high bandwidth*delay: • Rho – recovery time is propotional to RTT*RTT/MTU Don Petravick -- Fermilab
Experience from the test stands. Resolved as local switch issue Don Petravick -- Fermilab
Strategies • Smaller, lower bandwidth TCP streams in parallel • Examples of these are GridFTP and BBftp • Tweak AIMD algorithm • Logic is in the sender’s kernel stack only (congestion window) • FAST, and others – USCMS used an FNAL kernel mod in DC04 • May not be “fair” to others using shared network resources • Break the stream model, use UDP and ‘cleverness’, especially for file transfers. But: • You have to be careful and avoid congestion collapse. • You need to be fair to other traffic, and be very certain of it • Isolate strategy by confining transfer to a “pipe” Don Petravick -- Fermilab
Series of TCP investigations Don Petravick -- Fermilab
Pipes and File Transfer Primitives • Tell network the bandwidth of your stream using RSVP, Resource Reservation Protocol • Network will forward the packets/sec you reserved and drop the rest (QoS) • Network will not over subscribe the total bandwidth. • Network leaves some bandwidth out of the QoS for others. • Unused bandwidth is not available to others at high QoS. Don Petravick -- Fermilab
Storage Element File Stage In File Stage In File Stage Out Grid Side WAN FileSrv FileSrv FileSrv FileSrv FileSrv LAN Worker Node Side (POSIX style I/O) worker worker worker worker worker worker Don Petravick -- Fermilab
Storage System and Bandwidth • Storage Element does not know the bandwidth of individual stream very well at all • For example, a disk may have many simultaneous assessors or the file may be in memory cache and transferred immediately • Bandwidth depends on fileserver disk and your disk. • Requested bandwidth too small? • If QoS tosses a packet, AIMD will drastically affect transfer rate • Requested bandwidth too high? • Bandwidth at QoS level wasted, overall experimental rate suffers • Storage Element may know the aggregate bandwidth better than individual stream bandwidth. • Storage Element, therefore needs to aggregate flows onto a pipe between sites, not deal with QoS on a single flow. • This means the local network will be involved in aggregation. Don Petravick -- Fermilab
Lambda Station investigations Investigate support of static and dynamic pipes by storage systems in WAN transfers. • Fiber to Starlight optical exchange at Northwestern University. • Local improvements to forward traffic flows onto the pipe from our LAN • Local improvements to admit traffic flows onto our LAN from the pipe • Need changes to Storage System to exploit the WAN changes. Don Petravick -- Fermilab
Why last hop LAN? • Very,very large commodity infrastructures have been built on LANs and used in HEP. • Specialized SANS are not used generally in HEP • It must at least be the starting point for mingling advanced networks and large HENP data systems. Don Petravick -- Fermilab
Fiber to Starlight • FNAL’s fiber pair has the potential for 33 channels between FNAL and Starlight (3 to be activated soon) • Starlight provides FNAL’s access to Research and Education Networks: • ESnet • DOE Science Ultranet • Abilene • LHCnet (DOE-funded link to CERN) • SurfNet • UKLight • CA*Net • National Lambda Rail Don Petravick -- Fermilab
LAN – Pipe investigation • Starlight path bypasses FNAL border router • Aggregation of many flows to fill a (dynamic) pipe. • We believe that pipes will be ‘owned’ by a VO. • Forwarding to the pipe is done on a per flow basis • Starlight path ties directly to production LAN and production Storage Element (no dual NICs). Don Petravick -- Fermilab
Forwarding Server ESNet Starlight Forwarding server Router and Core Network File server Don Petravick -- Fermilab
Flow-by-flow Strategy • Storage element identifies flows to the forwarding server by using layer 5 information • Host IP, Dest IP, Host Port, Dest Port and Transfer Protocol • And VO information • Forwarding server informs peer site to allow admission • Forwarding server configures local router to forward flow over DWDM link or the flow takes the default route • 1 GB pipe is about 30 flows at 30 MB/S. • If flows are 1 GB files, this yields about 1 flow change/sec • Forwarding server allows flows to take alternate path when dynamic path is torn down. • Firewalls may have issues with this. • Incoming flows are analogous • Flow-by-Flow solution seems to suit problem well, but there are plenty of implementation issues. Don Petravick -- Fermilab
Changes to Storage Element to exploit dynamic pipes • Build semantics into bulk copy interfaces that allow for batching transfers to use bandwidth when available. • Based on bandwidth availability, dynamically change number of files transferred in parallel • Based on bandwidth availability, change the layer-5 (FTP) protocols used • Switch from FTP to UDP blaster (sabul) for example. • Or change the parameters used to tune layer-5 protocols, for example parallelism within ftp. • Deal with flows which have not completed when dynamic pipe is de-allocated. Don Petravick -- Fermilab
Summary (Hepix Talk) • There are conventional and research approaches to wide area networks. • The interactions in the wide area are interesting and important to grid based data systems • FNAL now has the facilities in place to investigate a number of these issues. • Storage Elements are important parts of the investigation and require changes to achieve high throughput and reliable transfers over WAN Don Petravick -- Fermilab
Summary (intro talk) • The vision is that large scale science is enabled by having systems which move data in a state-of-the-art manner. • A problem is that software time constants are many years • The tactic is to create demand and mutual understanding via interoperation of advanced networks and HEP data systems. Don Petravick -- Fermilab