420 likes | 580 Views
Long Distance experiments of Data Reservoir system. Data Reservoir / GRAPE-DR project. List of experiment members. The University of Tokyo Fujitsu Computer Technologies WIDE project Chelsio Communications Cooperating networks (West to East) WIDE APAN / JGN2
E N D
Long Distance experiments of Data Reservoir system Data Reservoir / GRAPE-DR project
List of experiment members The University of Tokyo Fujitsu Computer Technologies WIDE project Chelsio Communications Cooperating networks (West to East) WIDE APAN / JGN2 IEEAF / Tyco Telecommunications Northwest Giga Pop CANARIE (CA*net 4) StarLight Abilene SURFnet SCinet The Universiteit van Amsterdam CERN
Computing System for real Scientists • Fast CPU, huge memory and disks, good graphics • Cluster technology, DSM technology, Graphics processors • Grid technology • Very fast remote file accesses • Global file system, data parallel file systems, Replication facilities • Transparency to local computation • No complex middleware, or no small modification to existing software • Real Scientists are not computer scientists • Computer scientists are not work forces for real scientists
Objectives of Data Reservoir / GRAPE-DR(1) • Sharing Scientific Data between distant research institutes • Physics, astronomy, earth science, simulation data • High utilization of available bandwidth • Transferred file data rate > 90% of available bandwidth • Including header overheads, initial negotiation overheads • OS and File system transparency • Storage level data sharing (high speed iSCSI protocol on stock TCP) • Fast single file transfer
Objectives of Data Reservoir / GRAPE-DR(2) • GRAPE-DR:Very high-speed processor • 2004 – 2008 • Successor of Grape-6 astronomical simulator • 2PFLOPS on 128 node cluster system • 1G FLOPS / processor • 1024 processor / chip • Semi-general-purpose processing • N-body simulation, gridless fluid dyinamics • Linear solver, molecular dynamics • High-order database searching (genome, protein data etc.)
Data intensive scientific computation through global networks X-ray astronomy Satellite ASUKA Nuclear experiments Nobeyama Radio Observatory (VLBI) BelleExperiments Data Reservoir Very High-speed Network Digital Sky Survey Distributed Shared files Data Reservoir SUBARU Telescope Data Reservoir Local Accesses Grape6 Data analysis at University of Tokyo
Basic Architecture High latency Very high bandwidth Network Data Reservoir Disk-block level Parallel and Multi-stream transfer Local file accesses Cache Disks Data Reservoir Distribute Shared Data (DSM like architecture) Local file accesses Cache Disks
File accesses on Data Reservoir Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server IBM x345 (2.6GHz x 2)
Scientific Detectors User Programs File Server File Server File Server File Server iSCSI Bulk Transfer IP Switch IP Switch Global Network Disk Server Disk Server Disk Server Disk Server Global Data Transfer
1st Generation Data Reservoir • 1st generation Data Reservoir (2001-2002) • Efficient use of network bandwidth (SC2002 technical paper) • Single filesystem image (by striping of iSCSI disks) • Separation of local file accesses by users to remote disk to disk replication • Low level data-sharing (disk block level, iSCSI protocol) • 26 servers for 570 Mbps data transfer • High sustained efficiency (≒93%) • Low TCP performance • Unstable TCP performance
Problems of BWC2002 experiments • Low TCP bandwidth due to packet losses • TCP congestion window size control • Very slow recovery from fast recovery phase (>20min) • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • User and other network users have interests only to total behavior of parallel TCP streams
2nd Generation Data Reservoir • 2nd generation Data Reservoir (2003) • Inter-layer coordination for parallel TCP streams (SC2004 technical paper) • Transmission Rate Controlled TCP for Long Fat pie Netowks • “Dulling Edges of Cooperative Parallel streams (DECP (load balancing among parallel TCP streams) • LFN tunneling NIC (Comet TCP) • 10 times performance increase from 2002 • 7.01 Gbps (disk to disk, TRC-TCP), 7.53 Gbps (CometTCP)
24,000km(15,000miles) OC-48 x 3 GbE x 4 OC-192 15,680km(9,800miles) 8,320km (5,200miles) Juniper T320
Results • BWC2002 • Tokyo → Baltimore 10,800km(6,700miles) • Peak bandwidth (on network) 600 Mbps • Average file transfer bandwidth 560 Mbps • Bandwidth-distance products 6,048terabit-kilometers/second • BWC 2003 • Phoenix → Tokyo → Portland → Tokyo 24,000 km (15,000 miles) • Average file transfer bandwidth7.5Gbps • Bandwidth-distance products ≒ 168 petabit-kilometers/second • More than 25 times improvement from BWC2002 performance (bandwidth-distance products) • Still low performance for each TCP stream (460 Mbps)
Problems of BWC2003 experiments • Still Low TCP single stream performance • NIC is GbE • Software overheads • Large number of server and storage boxed for 10Gbps iSCSI disk service • Disk density • TCP single stream performance • CPU usage for managing iSCSI and TCP
3rd Generation Data Reservoir • Hardware and software basis for 100Gbps Distributed Data-sharing systems • 10Gbps disk data transfer by a single Data Reservoir server • Transparent support for multiple filesystems (detection of modified disk blocks) • Hardware(FPGA) implementation of Inter-layer coordination mechanisms • 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger
Features of a single server Data Reservoir Low cost and high-bandwidth disk service • A server + two RAID card + 50 disks • Small size, scalable to 100Gbps Data Reservoir (10 servers --- 2 racks) • Mother system for GRAPE-DR simulation engine • Compatible to existing Data Reservoir system • Single stream TCP performance (must be more than 5 Gbps) • Automatic adaptation for network condition • Maximize multi-stream TCP performance • Load balancing among parallel TCP streams
Utilization of 10Gbps network • A single box 10 Gbps Data Reservoir server • Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server) • Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth • Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger) PCI-X bus Quad Opteron Server (SUN V40z) Linux 2.6.6 10GBASE-SR Chelsio T110 TCP NIC 10G Ethernet Switch PCI-X bus Chelsio T110 TCP NIC PCI-X bus Ultra320SCSI SCSI adaptor PCI-X bus SCSI adaptor Data Reservoir Software
Single stream TCP performance • Importance of hardware-supported “Inter-layer optimization” mechanism • CometTCP hardware (TCP tunnel by non-TCP protocol) • Chelsio T110 NIC (pacing by network processor) • Dual ported FPGA based NIC (under development • Standard TCP, standard Ethernet frame size • Major problem • Conflict between long RTT and short RTT • Detection of optimal parameters Chelsio T110 Dual ported NIC by U. of Tokyo
Tokyo-CERN experiment (Oct.2004) • CERN-Amsterdam-Chicago-Seattle-Tokyo • SURFnet – CA*net 4 – IEEAF/Tyco – WIDE • 18,500 km WAN PHY connection • Performance result • 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf • 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf • 8.8 Gbps disk to disk performance • 9 servers, 36 disks • 36 parallel TCP streams
Network topology of CERN-Tokyo experiment T-LEX StarLight Opteron server Dual Opteron248,2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Chelsio T110 NIC Chelsio T110 NIC Opteron server Dual Opteron248,2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Fujitsu XG800 12 port switch Fujitsu XG800 Tokyo Minneapolis Chicago Foundry NetIron40G IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux 2.6.6 (No.2-7) Linux 2.4.X (No. 1) IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux 2.6.6 (No.2-7) Linux 2.4.X (No. 1) 10GBASE-LW Extreme Summit 400 IBM x345 IBM x345 Seattle Vancouver Amsterdam CERN (Geneva) IBM x345 IBM x345 Foundry BI MG8 Foundry FEXx448 GbE GbE Pacific Northwest Gigapop NetherLight Data Reservoir at Univ. of Tokyo Data Reservoir at CERN(Geneva) WIDE / IEEAF CA*net 4 SURFnet
Difficulty on CERN-Tokyo Experiment • Detecting SONET lever errors • Synchronization error • Low rate packet losses • Lack of tools for debugging network • Loopback at L1 is the only useful method • No tools for performance bugs • Very long latency
SC2004 bandwidth challenge (Nov.2004) • CERN-Amsterdam-Chicago-Seattle-Tokyo – Chicago - Pittsburgh • SURFnet – CA*net 4 – IEEAF/Tyco – WIDE – JGN2 – Abilene • 31,248 km (RTT 433 ms) (but shorter distance by LSR rules, 20,675 km) • Performance result • 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf • 1.6 Gbps disk to disk performance • 1 servers, 8 disks • 8 parallel TCP streams
Bandwidth Challenge: Pittsburgh-Tokyo-CERN Single stream TCP University of Amsterdam 31,248 km connection between SC2004 booth and CERN through Tokyo Latency 433 ms RTT Force10 Switch Chicago StarLight SCinet backborn HDXc Juniper T320 Router Juniper T640 Router ONS 15454 Amsterdam NetherLight Atlantic Ocean Force10 Switch Force10 Switch ONS 15454 SURFnet (at CERN) Procket Router Foundry NetIron 40G Foundry BI MG8 CERN ONS 15454 Minneapolis Chelsio T110 NIC ONS 15454 Fujitsu XG800 Switch Calgary Pacific Ocean Opteron server Dual Opteron248, 2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Opteron4 Chelsio T110 NIC Tokyo APAN Procket Router Opteron server Dual Opteron248, 2.2GHz 1GB memory Linux 2.6.6 (No.2-6) ONS 15454 Vancouver Opteron2 Data Reservoir booth at SC2004 Foundry NetIron 40G Seattle Pacific Northwest Gigapop Tokyo T-LEX ONS 15454 Data Reservoir project at CERN(Geneva) WIDE APAN/JGN II IEEAF/Tyco/WIDE CANARIE Abilene SCinet SURFnet Router or L3 switch L3 switch used as L2 L1 or L2 switch
Pittsburgh-Tokyo-CERN Single stream TCP speed measurement System • Server configuration • CPU: Dual AMD Opteron 248, 2.2GHz • Mother Board: RioWorks HDAMA rev. E • Memory: 1G bytes, Corsair Twinx CMX512RE-3200LLPT x 2, PC3200, CL=2 • OS: Linux kernel 2.6.6 • Disk Seagate IDE 80GB, 7200r.p.m. (disk speed not essential for performance) • Network interface card • Chelsio T110 (10GBASE-SR), TCP offload engine (TOE), • PCI-X/133 MHz I/O bus connection • TOE function ON • Technical points • Traditional tuning and optimization methods • congestion window size, buffer size, size of queue etc. • Ethernet frame size (standard frame, jumbo frame) • Tuning and optimization by Inter-layer coordination(Univ. of Tokyo) • Fine-grained pacing by offload engine • optimization at slow start phase • Utilization of flow control • Performance of single stream TCP on 31,248 km circuit • 7.21 Gbps (TCP payload), standard Ethernet frame • ⇒224,985 terabit meter / second • 80% more than previous Land Speed Record Observed bandwidth on the network (at Tokyo T-LEX NI40G CANARIE Amsterdam Calgary Vancouver Minneapolis IEEAF/Tyco/WIDE Chicago Geneva Seattle SURFnet CERN Abilene WIDE APAN/JGN II Pittsburgh Tokyo Network used in the experiment
Difficulty on SC2004 LSR challenge • Method for performance debugging • Dividing the path (CERN to Tokyo, Tokyo to Pittsburgh) • Jumbo frame performance (unknown reason) • Stability in network • QoS problem in the Procket 8800 routers
Year end experiments • Target • > 30,000 km LSR distance • L3 switching at Chicago and Amsterdam • Period of the experiment • 12/20 – 1/3 • Holiday season for vacant public research networks • System configuration • A pair of opteron servers with Chelsio T110 (at N-otemachi) • Another pair of opteron servers with Chelsion T110 for competing traffinc generation • ClearSight 10Gbps packet analyzer for packet capturing
Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo OME 6550 ONS 15454 Vancouver Calgary Router or L3 switch CANARIE ONS 15454 ONS 15454 L1 or L2 switch Minneapolis SURFnet Tokyo T-LEX Opteron1 IEEAF/Tyco ONS 15454 ONS 15454 OME 6550 Chelsio T110 NIC Chicago WAN PHY Opteron server University of Amsterdam WIDE Seattle Pacific Northwest Gigapop HDXc SURFnet Foundry NetIron 40G Force10 E1200 ONS 15454 Force10 E600 Opteron3 WAN PHY Pacific Ocean Chelsio T110 NIC Opteron server WIDE CISCO 6509 TransPAC Atlantic Ocean SURFnet Procket 8812 APAN/JGN Procket 8801 T640 OC-192 CISCO 12416 Chicago StarLight Fujitsu XG800 SURFnet Abilene OC-192 T640 HDXc SURFnet CISCO 12416 ClearSight 10Gbps capture SURFnet OC-192 Amsterdam NetherLight New York MANLAN Abilene Univ of Tokyo WIDE IEEAF/Tyco/WIDE CANARIE APAN/JGN2 SURFnet
opteron1 T110 NIC opteron3 T110 NIC NI40G HDXc E1200 tap ClearSight 10GE analyzer XG1200 Data Reservoir T-LEX UW CA*net 4 StarLight SURFnet UvA VLAN A (911) VLAN B (10) 10GE-LW 10GE-LW 10GE-LW 10GE-LW 10GE-SR 10GE-LR OC-192 OC-192 OC-192 OC-192 OC-192 15454 15454 OME6500 15454 E600 6509 15454 OME6500 IEEAF 10GE-LR 10GE-SR 10GE-LR 10GE-LR 10GE-LR OC-192 OC-192 OC-192 OC-192 OC-192 OC-192 PRO8801 T640 12416 CR1 12416 AR5 PRO8812 T640 HDXc GX JGN II VLAN D (914) "operational and production-oriented high-performance research and educations networks" APAN TransPAC Abilene MANLAN SURFnet Tokyo/notemachi Seattle Vancouver Chicago NYC A’dam v0.06, Dec 21, 2004, data-reservoir@adm.s.u-tokyo.ac.jp Figure 1a. Network topology
CANARIE Calgary Amsterdam Vancouver Minneapolis CA*net 4 IEEAF/Tyco/WIDE Seattle Chicago SURFnet APAN/JGN2 Abilene NYC Tokyo Network used in the experiment A router or an L3 switch A L1 or L2 switch Figure 2. Network connection
Difficulty on Tokyo-AMS-Tokyo experiment • Packet loss and Flow control • About 1 loss/several minutes when flow control off • About 100 losses/several minutes when flow control on • Flow control at edge switch does not work properly • Difference of network behavior by the direction • Artificial bursty behavior caused by switches and routers • Outgoing packets from NIC are paced properly. • Jumbo frame packet losses • Packet losses begin from MTU 4500 • NI40G and Force10 E1200, E600 are on the path. • Lack of peak network flow information • 5 min average BW is useless for TCP traffic
Network Traffic on routers and switches StarLight Force10 E1200 University of Amsterdam Force10 E600 Abilene T640 NYCM to CHIN TransPAC Procket 8801 Submitted run
History of single-stream IPv4 Internet Land Speed Record Distance bandwidth product Tbit m / s 2004/12/25 Data Reservoir project WIDEproject SURFnet CANARIE 216300 Tbit m / s Experimental result 1,000,000 100,000 2004/11/9 Data Reservoir project WIDE project 148850 Tbit m / s Internet Land Speed Record 10,000 1,000 2000 2001 2002 2003 2004 2005 2006 2007 Year
Setting for instrumentation • Capture packet at both source-end and destination-end • Packet filtering by src-dst IP address • Capture up to 512 MB • Precise packet counting was very useful Tap ClearSight Network To be tested Source Server XG800 Destination Server Source Server Fujitsu XG800 10G L2 switch Destination Server ClearSight 10Gbps Capture system
Possible arrangement of devices • 1U loss-less packet capturing box developed at U. of Tokyo • 2 10GBASE-LR interface • Lossless capture up to 1GB • Precise time stamping • Packet generating server • 3U opteron server + 10Gbps Ethernet adaptor • Pushing network by more than 5 Gbps TCP/UDP • Taps and tap selecting switch • Currently, port mirroring is not good for instrumentation • Capturing backborn traffic and server in/out
2 port packet capturing box • Originally designed for emulating 10Gbps network • 2 10GBASE-LR interface • 0 to 1000 ms artificial latency (very long FIFO queue) • Error rate 0 to 0.1 % ( 0.001% step) • Lossless packet capturing • Utilize FIFO queue capability • Data uploadable through USB • Cheaper than ClearSight or Agilent Technology
Network To be tested Network To be tested Setting for instrumentation • Capture packet at both source-end and destination-end • Packet filtering by src-dst IP address • Capture packet at both source-end and destination-end • Packet filtering by src-dst IP address Backborn Router Or Switch TAP Server + 10G Ether 2 port Packet capture Box
Network To be tested Network To be tested More expensive plan • Capture packet at both source-end and destination-end • Packet filtering by src-dst IP address • Capture packet at both source-end and destination-end • Packet filtering by src-dst IP address Backborn Router Or Switch TAP TAP Server + 10G Ether 2 port Packet capture Box 2 port Packet capture Box
Summary • Single Stream TCP • We removed TCP related difficulties • Now I/O bus bandwidth is the bottleneck • Cheap and simple servers can enjoy 10Gbps network • Lack of methodology in high-performance network debugging • 3 day debugging (overnight working) • 1 day stable period (usable for measurements) • Network may feel fatigue, some trouble must happen • We need something effective. • Detailed issues • Flow control (and QoS) • Buffer size and policy • Optical level setting