360 likes | 468 Views
Realization and Utilization of high-BW TCP on real application. Kei Hiraki Data Reservoir / GRAPE-DR project The University of Tokyo. Computing System for real Scientists. Fast CPU, huge memory and disks, good graphics Cluster technology, DSM technology, Graphics processors Grid technology
E N D
Realization and Utilization of high-BW TCP on real application Kei Hiraki Data Reservoir / GRAPE-DR project The University of Tokyo
Computing System for real Scientists • Fast CPU, huge memory and disks, good graphics • Cluster technology, DSM technology, Graphics processors • Grid technology • Very fast remote file accesses • Global file system, data parallel file systems, Replication facilities • Transparency to local computation • No complex middleware, or no small modification to existing software • Real Scientists are not computer scientists • Computer scientists are not work forces for real scientists
Objectives of Data Reservoir / GRAPE-DR(1) • Sharing Scientific Data between distant research institutes • Physics, astronomy, earth science, simulation data • Very High-speed single file transfer on Long Fat pipe Network • > 10 Gbps, > 20,000 Km, > 400ms RTT • High utilization of available bandwidth • Transferred file data rate > 90% of available bandwidth • Including header overheads, initial negotiation overheads
Objectives of Data Reservoir / GRAPE-DR(2) • GRAPE-DR:Very high-speed attached processor to a server • 2004 – 2008 • Successor of Grape-6 astronomical simulator • 2PFLOPS on 128 node cluster system • 1G FLOPS / processor • 1024 processor / chip • 8 chips / PCI card • 2 PCI card / serer • 2 M processor / system
Data intensive scientific computation through global networks X-ray astronomy Satellite ASUKA Nuclear experiments Nobeyama Radio Observatory (VLBI) BelleExperiments Data Reservoir Very High-speed Network Digital Sky Survey Distributed Shared files Data Reservoir SUBARU Telescope Data Reservoir Local Accesses Grape6 Data analysis at University of Tokyo
Basic Architecture High latency Very high bandwidth Network Data Reservoir Disk-block level Parallel and Multi-stream transfer Local file accesses Cache Disks Data Reservoir Distribute Shared Data (DSM like architecture) Local file accesses Cache Disks
File accesses on Data Reservoir Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server IBM x345 (2.6GHz x 2)
Scientific Detectors User Programs File Server File Server File Server File Server iSCSI Bulk Transfer IP Switch IP Switch Global Network Disk Server Disk Server Disk Server Disk Server Global Data Transfer
Problems found in 1st generation Data Reservoir • Low TCP bandwidth due to packet losses • TCP congestion window size control • Very slow recovery from fast recovery phase (>20min) • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • User and other network users have interests only to total behavior of parallel TCP streams
Fast Ethernet vs. GbE • Iperf in 30 seconds • Min/Avg: Fast Ethernet > GbE FE GbE
Packet Transmission Rate • Bursty behavior • Transmission in 20ms against RTT 200ms • Idle in rest 180ms Packet loss occurred
Packet Spacing • Ideal Story • Transmitting packet every RTT/cwnd • 24μs interval for 500Mbps (MTU 1500B) • High load for software only • Low overhead because of limited use at slow start phase RTT RTT/cwnd
Example Case of 8 IPG • Success on Fast Retransmit • Smooth Transition to Congestion Avoidance • CA takes 28 minutes to recover to 550Mbps
Best Case of 1023B IPG • Like Fast Ethernet case • Proper transmission rate • Spurious Retransmit due to Reordering
Unbalance within parallel TCP streams • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • Meaningless unfairness among parallel streams • User and other network users have interests only to total behavior of parallel TCP streams • Our approach • Constant Σcwnd i for fair TCP network usage to other users • Balance each cwnd i communicating between parallel TCP streams BW BW time time
3rd Generation Data Reservoir • Hardware and software basis for 100Gbps Distributed Data-sharing systems • 10Gbps disk data transfer by a single Data Reservoir server • Transparent support for multiple filesystems (detection of modified disk blocks) • Hardware(FPGA) implementation of Inter-layer coordination mechanisms • 10 Gbps Long Fat pipe Network emulator and 10 Gbps data logger
Utilization of 10Gbps network • A single box 10 Gbps Data Reservoir server • Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server) • Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth • Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger) PCI-X bus Quad Opteron Server (SUN V40z) Linux 2.6.6 10GBASE-SR Chelsio T110 TCP NIC 10G Ethernet Switch PCI-X bus Chelsio T110 TCP NIC PCI-X bus Ultra320SCSI SCSI adaptor PCI-X bus SCSI adaptor Data Reservoir Software
Tokyo-CERN experiment (Oct.2004) • CERN-Amsterdam-Chicago-Seattle-Tokyo • SURFnet – CA*net 4 – IEEAF/Tyco – WIDE • 18,500 km WAN PHY connection • Performance result • 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf • 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf • 8.8 Gbps disk to disk performance • 9 servers, 36 disks • 36 parallel TCP streams
CANARIE Calgary Amsterdam Vancouver Minneapolis CA*net 4 IEEAF Seattle Chicago SURFnet Geneva Tokyo Network used in the experiment End Systems A L1 or L2 switch Tokyo-CERN Network connection
Network topology of CERN-Tokyo experiment T-LEX StarLight Opteron server Dual Opteron248,2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Chelsio T110 NIC Chelsio T110 NIC Opteron server Dual Opteron248,2.2GHz 1GB memory Linux 2.6.6 (No.2-6) Fujitsu XG800 12 port switch Fujitsu XG800 Tokyo Minneapolis Chicago Foundry NetIron40G IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux 2.6.6 (No.2-7) Linux 2.4.X (No. 1) IBM x345 server Dual Intel Xeon 2.4GHz 2GB memory Linux 2.6.6 (No.2-7) Linux 2.4.X (No. 1) 10GBASE-LW Extreme Summit 400 IBM x345 IBM x345 Seattle Vancouver Amsterdam CERN (Geneva) IBM x345 IBM x345 Foundry BI MG8 Foundry FEXx448 GbE GbE Pacific Northwest Gigapop NetherLight Data Reservoir at Univ. of Tokyo Data Reservoir at CERN(Geneva) WIDE / IEEAF CA*net 4 SURFnet
LSR experiments • Target • > 30,000 km LSR distance • L3 switching at Chicago and Amsterdam • Period of the experiment • 12/20 – 1/3 • Holiday season for vacant public research networks • System configuration • A pair of opteron servers with Chelsio T110 (at N-otemachi) • Another pair of opteron servers with Chelsion T110 for competing traffinc generation • ClearSight 10Gbps packet analyzer for packet capturing
CANARIE Calgary Amsterdam Vancouver Minneapolis CA*net 4 IEEAF/Tyco/WIDE Seattle Chicago SURFnet APAN/JGN2 Abilene NYC Tokyo Network used in the experiment A router or an L3 switch A L1 or L2 switch Figure 2. Network connection
Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo OME 6550 ONS 15454 Vancouver Calgary Router or L3 switch CANARIE ONS 15454 ONS 15454 L1 or L2 switch Minneapolis SURFnet Tokyo T-LEX Opteron1 IEEAF/Tyco ONS 15454 ONS 15454 OME 6550 Chelsio T110 NIC Chicago WAN PHY Opteron server University of Amsterdam WIDE Seattle Pacific Northwest Gigapop HDXc SURFnet Foundry NetIron 40G Force10 E1200 ONS 15454 Force10 E600 Opteron3 WAN PHY Pacific Ocean Chelsio T110 NIC Opteron server WIDE CISCO 6509 TransPAC Atlantic Ocean SURFnet Procket 8812 APAN/JGN Procket 8801 T640 OC-192 CISCO 12416 Chicago StarLight Fujitsu XG800 SURFnet Abilene OC-192 T640 HDXc SURFnet CISCO 12416 ClearSight 10Gbps capture SURFnet OC-192 Amsterdam NetherLight New York MANLAN Abilene Univ of Tokyo WIDE IEEAF/Tyco/WIDE CANARIE APAN/JGN2 SURFnet
Network Traffic on routers and switches StarLight Force10 E1200 University of Amsterdam Force10 E600 Abilene T640 NYCM to CHIN TransPAC Procket 8801 Submitted run
Summary • Single Stream TCP • We removed TCP related difficulties • Now I/O bus bandwidth is the bottleneck • Cheap and simple servers can enjoy 10Gbps network • Lack of methodology in high-performance network debugging • 3 day debugging (overnight working) • 1 day stable period (usable for measurements) • Network may feel fatigue, some trouble must happen • We need something effective. • Detailed issues • Flow control (and QoS) • Buffer size and policy • Optical level setting
Systems used in Long-distance TCP experimentsCERNPittsburghTokyo
Efficient and effective utilization of High-speed internet • Efficient and effective utilization of 10Gbps network is still very difficult • PHY, MAC, Data-link , and Switches • 10Gbps is ready to use • Network interface adaptor • 8Gbps is ready to use, 10Gbps in several months • Proper offloading, RDMA implementation • I/O bus of a server • 20 Gbps is necessary to drive 10Gbps network • Drivers, operating system • Too many interruption, buffer memory management • File system • Slow NFS service • Consistency problem
Difficulty in10Gbps Data Reservoir • Disk to disk Single Stream TCP data transfer • High CPU utilization (performance limit by CPU) • Too many context switches • Too many interruption from Network adaptor (> 30,000/s) • Data copy from buffers to buffers • I/O bus bottleneck • PCI-X/133 --- maximum 7.6Gbps data transfer • Waiting for PCI-X/266 or PCI-express x8 or x16 NIC • Disk performance • Performance limit of RAID adaptor • Number of disks for data transfer (>40 disks are required) • File system • High BW in file service is more difficult than data sharing
High-speed IP network in supercomputing(GRAPE-DR project) • World fastest computing system • 2PFLOPS in 2008 (performance on actual application programs) • Construction of general-purpose massively parallel architecture • Low power consumption in PFLOPS range performance • MPP architecture more general-purpose than vector architecture • Use of comodity network for interconnection • 10Gbps optical network (2008) + MEMs switches • 100Gbps optical network (2010)
FLOPS 30 10 27 10 1Y 1Z 1E 1P 1T 1G 1M Target performance Grape DR 2PFLOPS Parallel processors KEISOKU supercomputer 10PFLOPS Earth Simulator 40TFLOPS Processor chips 1K 256 64 16 Year 70 80 90 2000 2010 2020 2030 2040 2050
GRAPE-DRarchitecture • Massively Parallel Processor • Pipelined connection of a large number of PEs • SIMASD (Single Instruction on Multiple and Shared Data) • All instruction operates on Data of local memory and shared memory • Extension of vector architecture • Issues • Compiler for SIMASD architecture (currently developing – flat-C) Local Memory Integer ALU Floating point ALU 512 PEs G F CP + On chip network On chip shared memory Shared memory Outside world
Hierarchical construction of GRAPE-DR メモリ 512PE/Chip 512 GFlops /Chip 2KPE/PCI board 2TFLOPS/PCI board 8 KPE/Server 8 TFLOPS/Server 2MPE/System 2PFLOPS/System 1MPE/Node 1PFLOPS/Node
Network architecture inside a GRAPE-DR system IP storage system AMD based server Memory bus 100Gbps iSCSIサーバ 光インタフェース Memory KOE MEMs based optical switch Highly functional router Adaptive compier Total system conductor For dynamic optimization Outside IP network