Storage System Integration with High Performance Networks

Storage System Integration with High Performance Networks Jon Bakken and Don Petravick FNAL

Overview • Review of some salient characteristics of wide-area networks. • Describe initial investigations at Fermilab for optimizing wide area file transfers • integrated with production WAN/LAN and storage systems.

Wide Area Characteristics • Most prominent characteristic, compared to LAN, is the very large bandwidth*delay product. • Underlying structure – it’s a packet world! • Possible to use pipes between specific sites • These circuits can be both static and dynamic • Both IP and non-IP (for example, Fibre-channel over sonet) • FNAL has proposed investigations and has just begun studies with its storage systems to optimize WAN file transfers using pipes.

Bandwidth*Delay • At least bandwidth*delay bytes must be kept in flight on the network to maintain bandwidth. • This fact is independent of protocol. • Current practice uses more than this lower limit. For example, US CMS used ~2x for their DC04. • CERN <–> FNAL has a measured ~60 ms delay • Using the 2x factor, 120 ms delay gives • 30 MB/sec  ~3-4 MB “in flight” • 1000 MB/sec  ~120 MB “in flight”

Bandwidth*Delay and IP • Given a single lost packet and a standard MTU size of 1500 bytes, the host will receive many out-of-order packets before receiving the retransmitted missing packet. • Must incur at least 2 “delays worth” • FNAL <-> CERN (2*60 ms delay) • 30 MB/sec: more than 2400 packets • 1000 MB/sec: more than 80000 packets

Knee-Cliff-Collapse Model • When load on a segment approaches a threshold, a modest increases in throughput is a accompanied by a great increases delay. • Even more throughput results in congestion collapse. • Can not load a network arbitrarily. • TCP tries to avoid collapse, but its solution has problems at large bandwidth*delay

Bandwidth and Delay and TCP • Stream model of TCP implies packet buffering is in kernel - this leads to kernel efficiency issues. • Vanilla TCP behaves as if all packet loss is caused by congestion. • TCP Solution is to back off throughput to avoid the congestion collapse in AIMD fashion: • Lost packet? Cut packets in flight by ½ • Success? Open window next time by one more packet • This leads to a very large recovery time at high bandwidth*delay: • 30 MB/sec drops to 15 MB/sec with just 1 lost packet • Recovery time is 15 MB / 1500 byte MTU = 10000 * 120 ms • Recovery time is 1200 sec = 20 minutes!

Strategies • Smaller, lower bandwidth TCP streams in parallel • Examples of these are GridFTP and BBftp • Tweak AIMD algorithm • Logic is in the sender’s kernel stack only (congestion window) • FAST, and others – USCMS used an FNAL kernel mod in DC04 • May not be “fair” to others using shared network resources • Break the stream model, use UDP and ‘cleverness’, especially for file transfers. But: • You have to be careful and avoid congestion collapse. • You need to be fair to other traffic, and be very certain of it • Isolate strategy by confining transfer to a “pipe”

Pipes and File Transfer Primitives • Tell network the bandwidth of your stream using RSVP, Resource Reservation Protocol • Network will forward the packets/sec you reserved and drop the rest (QoS) • Network will not over subscribe the total bandwidth. • Network leaves some bandwidth out of the QoS for others. • Unused bandwidth is not available to others at high QoS.

Storage Element File Stage In File Stage In File Stage Out Grid Side WAN FileSrv FileSrv FileSrv FileSrv FileSrv LAN Worker Node Side (POSIX style I/O) worker worker worker worker worker worker

Storage System and Bandwidth • Storage Element does not know the bandwidth of individual stream very well at all • For example, a disk may have many simultaneous assessors or the file may be in memory cache and transferred immediately • Bandwidth depends on fileserver disk and your disk. • Requested bandwidth too small? • If QoS tosses a packet, AIMD will drastically affect transfer rate • Requested bandwidth too high? • Bandwidth at QoS level wasted, overall experimental rate suffers • Storage Element may know the aggregate bandwidth better than individual stream bandwidth. • Storage Element, therefore needs to aggregate flows onto a pipe between sites, not deal with QoS on a single flow. • This means the local network will be involved in aggregation.

FNAL investigations Investigate support of static and dynamic pipes by storage systems in WAN transfers. • Fiber to Starlight optical exchange at Northwestern University. • Local improvements to forward traffic flows onto the pipe from our LAN • Local improvements to admit traffic flows onto our LAN from the pipe • Need changes to Storage System to exploit the WAN changes.

Fiber to Starlight • FNAL’s fiber pair has the potential for 33 channels between FNAL and Starlight (3 to be activated soon) • Starlight provides FNAL’s access to Research and Education Networks: • ESnet • DOE Science Ultranet • Abilene • LHCnet (DOE-funded link to CERN) • SurfNet • UKLight • CA*Net • National Lambda Rail

LAN – Pipe investigation • Starlight path bypasses FNAL border router • Aggregation of many flows to fill a (dynamic) pipe. • We believe that pipes will be ‘owned’ by a VO. • Forwarding to the pipe is done on a per flow basis • Starlight path ties directly to production LAN and production Storage Element (no dual NICs).

Forwarding Server ESNet Starlight Forwarding server Router and Core Network File server

Flow-by-flow Strategy • Storage element identifies flows to the forwarding server by using layer 5 information • Host IP, Dest IP, Host Port, Dest Port and Transfer Protocol • And VO information • Forwarding server informs peer site to allow admission • Forwarding server configures local router to forward flow over DWDM link or the flow takes the default route • 1 GB pipe is about 30 flows at 30 MB/S. • If flows are 1 GB files, this yields about 1 flow change/sec • Forwarding server allows flows to take alternate path when dynamic path is torn down. • Firewalls may have issues with this. • Incoming flows are analogous • Flow-by-Flow solution seems to suit problem well, but there are plenty of implementation issues.

Changes to Storage Element to exploit dynamic pipes • Build semantics into bulk copy interfaces that allow for batching transfers to use bandwidth when available. • Based on bandwidth availability, dynamically change number of files transferred in parallel • Based on bandwidth availability, change the layer-5 (FTP) protocols used • Switch from FTP to UDP blaster (sabul) for example. • Or change the parameters used to tune layer-5 protocols, for example parallelism within ftp. • Deal with flows which have not completed when dynamic pipe is de-allocated.

Summary • There are conventional and research approaches to wide area networks. • The interactions in the wide area are interesting and important to grid based data systems • FNAL now has the facilities in place to investigate a number of these issues. • Storage Elements are important parts of the investigation and require changes to achieve high throughput and reliable transfers over WAN

Storage System Integration with High Performance Networks

Storage System Integration with High Performance Networks

Presentation Transcript

High Performance Computing Course Notes 2007-2008 High Performance Storage

High Performance Cloud Storage Technical Architecture

RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE

High Performance Cloud Storage Introduction

High-Performance System Design

High-Performance Networks for Dataflow Architectures

ObliviStore High Performance Oblivious Cloud Storage

Revolutionizing High Performance SAN Storage

High Performance Storage System

Storage Networks

Global High Performance Networks

COSC 5341 High-Performance Computer Networks

System Integration and Performance

High-Performance Reliable Distributed Storage Systems

High Performance Communication Networks

High-Performance & Customizable Data Storage Server

Redundancy in High Performance Networks

High Performance Computing Course Notes 2007-2008 High Performance Storage

High Performance Storage System

Storage System Integration with High Performance Networks

CSE 5346 – Networks II: High Performance Networks

High Performance Storage Service Virtualization