The DataTAG Project 1 st European Across Grids Conference Santiago de Compostela, Spain

The DataTAG Project 1st European Across Grids Conference Santiago de Compostela, Spain Olivier H. Martin / CERN http://www.datatag.org

Presentation outline • Partners: • Funding agencies & cooperating networks • EU & US partners • Goals • Networking Testbed • Grid Interoperability • Early results • Status & perspectives • Major Grid networking issues • PFLDnet workshop headlines The DataTAG Project

Funding agencies Cooperating Networks

EU contributors

US contributors

Main Goals • End to end Gigabit Ethernet performance using innovative high performance transport protocol stacks • Assess, develop & demonstrate inter-domain QoS and bandwidth reservation techniques. • Interoperability between major GRID projects middleware/testbeds in Europe and North America. • DataGrid, CrossGrid, possibly other EU funded Grid projects • PPDG, GriPhyN, iVDGL, Teragrid (USA) • LCG (LHC Computing Grid) The DataTAG Project

In a nutshell • Grid related network research (WP2, WP3): • 2.5 Gbps transatlantic lambda between CERN (Geneva) and StarLight (Chicago) (WP1) • Dedicated to research (no production traffic) • Very unique multi-vendor testbed with layer 2 and layer 3 capabilities • in effect, a distributed transatlantic Internet Exchange Point. • Interoperability between European and US Grids (WP4) • Middleware integration and coexistence - Applications • GLUE integration/standardization • GLUE testbed and demo The DataTAG Project

Multi-vendor testbed with layer 2 & layer 3 capabilities STARLIGHT (Chicago) INFN (Bologna) CERN (Geneva) Abilene Canarie ESnet INRIA (Lyon) GEANT Surfnet 2.5Gbps Juniper Juniper Research2.5Gbps Cisco 6509 M M Alcatel Alcatel GBE GBE Cisco Cisco M= Layer 2 Mux The DataTAG Project

Phase I - Generic configuration(August 2002) Servers CERN StarLight Servers GigE switch GigE switch 2.5Gbps C7606 C7606 The DataTAG Project

Phase II (March 2003) VTHD Routers Servers GigE switch A1670 Multiplexer GigE switch A7770 C7606 CERN n*GigE J-M10 C-ONS15454 Amsterdam GEANT STARLIGHT Servers Ditto Abilene ESNet Canarie The DataTAG Project

Phase III (September 2003, tentative) VTHD Routers Servers GigE switch Multi-Service Multiplexer GigE switch A7770 10Gbps n*GigE/10GigE C7606 n*2.5Gbps) CERN J-M10 C7609 C-ONS15454 Amsterdam GEANT STARLIGHT Servers Ditto Abilene ESNet Canarie The DataTAG Project

UK SuperJANET4 FR NL ATRIUM/VTHD SURFnet INRIA GEANT IT GARR-B DataTAG connectivity NewYork Abilene 3*2.5G STAR-LIGHT ESNET CERN 2.5G 10G MREN STAR-TAP The DataTAG Project Major 2.5/10 Gbps circuits between Europe & USA

DataTAG Network map R06chi-Alcatel7770 R06gva-Alcatel7770 W01chi w02chi w03chi w04chi w05chi w06chi ONS15454 ONS15454 V10chi v11chi v12chi v13chi W01gva w02gva w05gva w06gva SURFNET SURFNET W03gva w04gva Stm16(GC) 4x1GE 1GE 8x1GE 2x1GE CANARIE 2x1GE 2x1GE Alcatel 1670 Alcatel 1670 ONS15454 VTHD/INRIA 2x1GE 10x1GE 4x1GE 2x1GE 1GE 2x1GE Stm16 (FranceTelecom) 1GE 1GE 1GE 2x1GE R05gva-JuniperM10 Stm16(DTag) Extreme Summit5i Extreme Summit1i R05chi-JuniperM10 R04chi-Cisco7609 R04gva-Cisco7606 10GE Stm64(L3) SUNNYVALE 1GE DataTAG 2x1GE Cisco5505-management ABILENE 1GE 10GE Teragrid JuniperT640 Chicago Geneva CERN External Network 1GE Vlan4 Vlan5 SWITCH 1GE Vlan7 1GE Stm4(DTag) 1GE Stm16(Swisscom) Cisco2950-management GEANT Cernh4-Cisco7609 Cernh7-Cisco7609 ar3-chicago -Cisco7606 Stm16(Colt) backup+projects The DataTAG Project GARR/CNAF edoardo.martelli@cern.ch - last update: 20021204

GriPhyN PPDG iVDGL DataTAG/WP4 framework and relationships HEP applications, Other experiments Integration HICB/HIJTB Interoperability standardization GLUE The DataTAG Project

Status of GLUE activities in DataTAG • Resource Discovery and GLUE schema • Authorization services from VO LDAP to VOMS • Common software deployment procedures • IST2002 and SC2002 joint EU-US demos Interoperability : between Grid domains for core grid services, coexistence of optimization/collective services • Data Movement and Description • Job Submission Services The DataTAG Project

Some early results: Atlas Canada Lightpath Data Transfer Trial A Terabyte of research data was recently transferred between Vancouver and CERN from disk-to-disk at close to Gbps rates This is equivalent to transferring a full CD in less than 8 seconds (or a full length DVD movie in less than 1 minute)How much data is a Terabyte?Equivalent to the amount of data on approximately 1500 CDs (680M) or 200 full length DVD movies Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (Alberta), Wade Hong (Carleton) The DataTAG Project

Extreme Networks TRIUMF CERN The DataTAG Project

Comparative Results The DataTAG Project

Project status • Great deal of expertise on high speed transport protocols is now available through DataTAG • We plan to make more active dissemination in 2003 to share our experiences with the Grid community at large • The DataTAG testbed is open to other EU Grid projects • In order to guarantee excluse access to the testbed a reservation application has been developed • Proved to be essential • As the access requirements to the testbed are evolving (e.g. access to GEANT, INRIA) and as the testbed itself is changing (e.g. inclusion of additional layer 2 services) • Additional features will need to be provided The DataTAG Project

Major Grid networking issues (1) • QoS (Quality of Service) • Still largely unresolved on a wide scale because of complexity of deployment • Non elevated services like “Scavenger/LBE” (lower than best effort) or Alternate Best Effort (ABE) are very fashionable! • End to end performance in the presence of firewalls • There is (will always be) a lack of high performance firewalls, can we rely on products becoming available or should a new architecture be evolved? • Full ransparency • Evolution of LAN infrastructure to 1Gbps then 10Gbps • Uniform end to end performance (LAN/WAN) The DataTAG Project

CERN’s new firewall architecture Gbit Ethernet Regular flow HTAR (High Throughput Access Route) CERNH2 (Cisco OSR 7603) 1/10 Gbit Ethernet FastEthernet CiscoPIX FastEthernet FastEthernet Cabletron SSR Securitymonitor The DataTAG Project Gbit Ethernet

Major Grid networking issues (2) • TCP/IP performance over high bandwidth, long distance networks • The loss of a single packet will affect a 10Gbps stream with 100ms RTT (round trip time) for 1.16 hours. During that time the average throughput will be 7.5 Gbps. • On the 2.5Gbps DataTAG circuit with 100ms RTT, this translates to 38 minutes recovery time, during that time the average throughput will be 1.875Gbps. • Link error & loss events rates • A 2.5 Gbps circuit can absorb 0.2 Million 1500 Bytes packets/second • Bit error rates of 10E-9 means one packet loss every 250 milliseconds • Bit error rates of 10E-11 means one packet loss every 25 seconds The DataTAG Project

TCP/IP Responsiveness (I)Courtesy S. Ravot (Caltech) • The responsiveness measures how quickly we go back to using the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost. C : Capacity of the link 2 C . RTT r = 2 . MSS The DataTAG Project

TCP/IP Responsiveness (II)Courtesy S. Ravot (Caltech) The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the responsiveness is multiplied by two. Therefore, values above have to be multiplied by two! The DataTAG Project

Maximum throughput with standard Window sizesas a function of the RTT W(KB) 16 32 64 RTTms 25 640K 1.28M 2.56MB/s 50 320K 640K 1.28MB/s 100 160K 320K 640KB/s N.B. The best throughput one can hope for, on a standard intra-European path with 50ms RTT, is only about 10Mb/s! The DataTAG Project

Considerations onWAN & LAN • For many years the Wide Area Network has been the bottlemeck, hence the common belief that if that bottleneck was to disappear, a global transparent Grid could be easily deployed! • Unfortunately, in real environments good end to end performance, e.g. Gigabit Ethernet, is somewhat easier to achieve when the bottleneck link is in the WAN rather than in the LAN. • E.g. 1GigE over 622M rather than 1GigE over 2.5G The DataTAG Project

Considerations on WAN & LAN (cont) • The dream of abundant bandwith has now become a reality in large, but not all, parts of the world! • Challenge shifted from getting adequate bandwidth to deploying adequate LANs and security infrastructure as well as making effective use of it! • Major transport protocol issues still need to be resolved, however there are very encouraging signs that practical solutions may now be in sight (see PFLDnet summary). The DataTAG Project

PFLDnet workshop(CERN – Feb 3-4) • 1st workshop on protocols for fast long distance networks • Co-organized by Caltech & DataTAG • Sponsored by Cisco • Most key actors were present • e.g. S. Floyd, T. Kelly, S. Low • Headlines: • High Speed TCP (HSTCP), Limited Slow start • Quickstart, XCP, Tsunami • GridDT, Scalable TCP, FAST (Fast AQM (Active Queue Management) The DataTAG Project

TCP dynamics(10Gbps, 100ms RTT, 1500Bytes packets) • Window size (W) = Bandwidth*Round Trip Time • Wbits = 10Gbps*100ms = 1Gb • Wpackets = 1Gb/(8*1500) = 83333 packets • Standard Additive Increase Multiplicative Decrease (AIMD) mechanisms: • W=W/2 (halving the congestion window on loss event) • W=W + 1 (increasing congestion window by one packet every RTT) • Time to recover from W/2 to W (congestion avoidance) at 1 packet per RTT: • RTT*Wp/2 = 1.157 hour • In practice, 1 packet per 2 RTT because of delayed acks, i.e. 2.31 hour • Packets per second: • RTT*Wpackets = 833’333 packets The DataTAG Project

HSTCP (IETF Draft August 2002) • Modifying TCP’s response function in order to allow high performance in high-speed environments and in the presence of packet losses • Target • 10Gbps performance in 100ms Round Trip Times (RTT) environments • Acceptable fairness when competing with standard TCP in environments with packet loss rates of 10-4 or 10-5. • Wmss = 1.2/sqrt(p) • Equivalent to W/1.5 RTT between losses The DataTAG Project

HSTCP Response Function(Additive Increase HSTCP vs standard TCP) Packet Congestion RTTs between Drop Rate Window Losses 10-2 12 8 10-3 38 25 10-4 120(263) 80(38) 10-5 379(1795) 252(57) 10-6 1200(12279) 800(83) 10-7 3795(83981) 2530(123) …… 10-10 120000(26864653) 80000(388) The DataTAG Project

Limited Slow-Start (IETF Draft August 2002) • Current « slow-start » procedure can result in increasing the congestion window by thousands of packets in a single RTT • Massive packet losses • Counter-productive • Limited slow-start introduces a new parameter max_ssthresh in order to limit the increase of the congestion window. • max_ssthresh < cwnd < ssthresh • Recommended value 100 MSS • K = int (cwnd/(0.5*max_ssthresh)) • When cwnd > max_ssthresh • Cwnd += int(MSS/K) for each receivingACK • instead of Cwnd += MSS • This ensures that cwnd is increased by at most max_ssthresh/2 per RTT, i.e. • ½MSS when cwnd=max_ssthresh, • 1/3MSS when cwnd=1.5*max_ssthresh, • etc The DataTAG Project

Limited Slow-Start (cont.) • With limited slow-start it takes: • Log(max_ssthresh) RTTs • to reach the condition where cwnd = max_ssthresh • Log(max_ssthresh) + (cwnd – max_ssthresh)/(max_sstresh/2) RTTs • to reach a congestion window of cwnd when cwnd > max_ssthresh • Thus with max_ssthresh = 100 MSS • It would take 836 RTT to reach a congestion window of 83000 packets • Compared to 16 RTT otherwise (assuming NO packet drops) • Transient queue limited to 100 packets against 32000 packets otherwise! • Limited slow-start could be used in conjunction with rate based pacing The DataTAG Project

Slow-start vs Limited Slow-start 100000 ssthresh (83333) 10000 Congestion window size (MSS) 1000 max-ssthresh (100) 100 10Gbps bandwidth! (RTT=100msec, MSS=1500B) 16000 1600 Time (RTT) 160 16 The DataTAG Project

QuickStart • Initial assumption is that routers have the ability to determine whether the destination link is significantly under-utilized • Similar capabilities also assumed for Active Queue Management (AQM) and Early Congestion Notifications (ECN) techniques • Coarse grain mechanism only focusing on initial window size • Incremental deployment • New IP & TCP options • QS request (IP) & QS response (TCP) • Initial Window size = Rate*RTT*MSS The DataTAG Project

QuickStart (cont.) • SYN/SYN-ACK IP packets • New IP option Quick Start Request (QSR) • Two TTL (Time To Live) • IP & QSR • Sending rate expressed in packet rates per 100ms • Therefore maximum rate is 2560 packets/seconds • Rate based pacing assumed • Non-participating router ignores QSR option • Therefore does not decrease QSR TTL • Participating router • Delete QSR option or reset initial sending rate • Accept or reduce initial rate The DataTAG Project

Scalable TCP(Tom Kelly – Cambridge) • The Scalable TCP algorithm modifies the characteristic AIMD behaviour of TCP for the conditions found with high bandwidth delay links. • This work differs from the High Speed TCP proposal by using a fixed adjustment for both the increase and the decrease of the congestion window. • Scalable TCP alters the congestion window, cwnd, on each acknowledgement in a RTT without loss by • cwnd -> cwnd + 0:02 • and for each window experiencing loss, cwnd is reduced by, • cwnd -> cwnd – 0.125*cwnd The DataTAG Project

Scalable TCP (2) • The responsiveness of traditional TCP connection to loss events is proportional to: • the connection’s window size and round trip time. • With Scalable TCP the responsiveness is proportional to the round trip time only. • this invariance to link bandwidth allows Scalable TCP to outperform traditional TCP in high speed wide area networks. • For example, the responsiveness of a connection with round trip time 200ms; • for a traditional TCP connection it is nearly 3 minutes at 100 Mbit/s and 28 minutes at 1 Gbit/s • while a connection using Scalable TCP will have a packet loss recovery time about 2.7s at any rate. The DataTAG Project

Scalable TCP status • Scalable TCP has been implemented on a Linux 2.4.19 kernel. • The implementation went through various performance debugging iterations primarily relating to kernel internal network buffers and the SysKonnect driver. • These alterations, termed the gigabit kernel modifications, remove the copying of small packets in the SysKonnect driver and the scale device driver decoupling buffers to reflect Gigabit Ethernet devices. • Initial results on performance suggest that the variant is capable of providing high speed in a robust manner using only sender side modifications. • Up to 400% improvement over standard Linux 2.4.19 • It is also intended to improve the code performance to lower CPU utilisation where, for example, currently a transfer rate of 1 Gbit/s uses 50% of a dual 2.2Ghz Xeon including the user (non-kernel) copy. The DataTAG Project

Grid DT(Sylvain Ravot/Caltech) • Similar to MulTCP i.e. aggregate N virtual TCP connections on a single connection • Avoid the brute force approach of opening N parallel connections a la GridFTP or BBFTP • Set of patches to Linux RedHat allowing to control: • Slow start threshold & behaviour • AIMD parameters The DataTAG Project

Linux Patch “GRID DT” • Parameter tuning • New parameter to better start a TCP transfer • Set the value of the initial SSTHRESH • Modifications of the TCP algorithms (RFC 2001) • Modification of the well-know congestion avoidance algorithm • During congestion avoidance, for every acknowledgement received, cwnd increases by A * (segment size) * (segment size) / cwnd.This is equivalent to increase cwnd by A segments each RTT. A is called additive increment • Modification of the slow start algorithm • During slow start, for every acknowledgement received, cwnd increases by M segments. M is called multiplicative increment. • Note: A=1 and M=1 in TCP RENO. • Smaller backoff • Reduce the strong penalty imposed by a loss The DataTAG Project

Grid DT • Very simple modifications to the TCP/IP stack • Alternative to Multi-streams TCP transfers • Single stream vs Multi streams • it is simpler • startup/shutdown are faster • fewer keys to manage (if it is secure) • Virtual increase of the MTU. • Compensate the effect of delayed ack • Can improve “fairness” • between flows with different RTT • between flows with different MTU The DataTAG Project

Comments on above proposals • Recent Internet history shows that: • any modifications to the Internet standards can take years before being: • accepted and widely deployed, • especially if it involves router modifications, e.g. • RED, ECN • Therefore, the chances of getting Quickstart or XCP type proposals implemented in commercial routers soon are somewhat limited! • Modifications to TCP stacks are more promising and much easier to deploy incrementally when: • Only the TCP stack of the sender is affected • Is Active network technology a possible solution to the help solve de-facto freeze of the Internet? The DataTAG Project

Additional slides • Tsunami (S. Wallace/Indiana Uni) • Grid DT (S. Ravot/Caltech) • FAST (S. Low/Caltech) The DataTAG Project

The Tsunami Protocol(S. Wallace/University of Indiana) • Developed specifically to address extremely high-performance batch file transfer over global-scale WANs. • Transport is UDP using 32K datagrams/blocks superimposed over standard 1500-byte Ethernet packets. • No sliding window (a-la TCP), each missed/dropped block is re-requested autonomously (similar to smart ACK) • Very limited congestion avoidance compared to TCP. Loss behavior is similar to Ethernet collision behavior, not TCP congestion avoidance. The DataTAG Project

Tsunami Protocol UDP Data Flow … 9 4 3 Data Type Seq Data Type Seq Data Type Seq Server (retransmit request) (shutdown request) Client 5 6 7 8 TCP Control Flow The DataTAG Project

Effect of the MTU on the responsiveness • Larger MTU improves the TCP responsiveness because you increase your cwnd by one MSS each RTT. • Couldn’t reach wire-speed with standard MTU • Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of packets) Effect of the MTU on a transfer between CERN and Starlight (RTT=117 ms, bandwidth=1 Gb/s) The DataTAG Project

TCP: • Carries >90% Internet traffic • Prevents congestion collapse • Not scalable to ultrascale network • Equilibrium and stability problems Ns2 Simulation The DataTAG Projectnetlab.caltech.edu

Intellectual advances • New mathematical theory of large scale networks • FAST = Fast Active-queue-managed Scalable TCP • Innovative implementation: TCP stack in Linux • Experimental facilities • “High energy physics networks” • Caltech and CERN/DataTAG site equipment:Switches, Routers Servers • Level(3) SNV-CHI OC192 Link; DataTAG link; Cisco 12406, GbE and 10 GbE port cards donated • Abilene, Calren2, … • Unique features: • Delay (RTT) as congestion measure • Feedback loop for resilient window, and stable throughput The DataTAG Projectnetlab.caltech.edu

Rf (s) TCP x AQM p Rb’(s) Theory Experiment Internet: distributed feedback system Geneva 7000km Sunnyvale Baltimore 3000km 1000km Chicago SCinetBandwidth Challenge SC2002 Baltimore, Nov 2002 Highlights • FAST TCP • Standard MTU • Peak window = 14,100 pkts • 940 Mbps single flow/GE card • 9.4 petabit-meter/sec • 1.9 times LSR • 9.4 Gbps with 10 flows • 37.0 petabit-meter/sec • 6.9 times LSR • 16 TB in 6 hours with 7 flows • Implementation • Sender-side modification • Delay based • Stabilized Vegas Sunnyvale-Geneva Baltimore-Geneva Baltimore-Sunnyvale SC2002 10 flows SC2002 2 flows I2 LSR 29.3.00 multiple SC2002 1 flow 9.4.02 1 flow 22.8.02 IPv6 The DataTAG Project C. Jin, D. Wei, S. Low FAST Team & Partners netlab.caltech.edu/FAST

The DataTAG Project 1 st European Across Grids Conference Santiago de Compostela, Spain

The DataTAG Project 1 st European Across Grids Conference Santiago de Compostela, Spain

Presentation Transcript

Vigo and Santiago de Compostela

Elena G. Ferreiro Universidade de Santiago de Compostela, Spain

University of Santiago de Compostela, Spain

University of Santiago de Compostela, Spain

University of Santiago de Compostela, Spain

SANTIAGO DE COMPOSTELA

REGIO PROJECT FINAL CONFERENCE SANTIAGO DE COMPOSTELA 17th JUNE 2013

StudyVisits Santiago de Compostela 27. – 30. 1. 2014

The DataTAG Project

Santiago de Compostela

Across Grids Conference