280 likes | 290 Views
RFTP: The File Transfer Protocol Application in 100Gbps Networks Brookhaven National Laboratory and Stony Brook University. Outline. Project Introduction Challenges and Requirements Software System Design Project Status and Future Work Testbed Evaluation and Demo Results.
E N D
RFTP: The File Transfer Protocol Application in 100Gbps NetworksBrookhaven National Laboratory and Stony Brook University
Outline • Project Introduction • Challenges and Requirements • Software System Design • Project Status and Future Work • Testbed Evaluation and Demo Results
Today’s Data-intensive Applications • DOE Leadership Computing Facilities, Data centers, grid and cloud computing, network storage • Explosion of data, and massive data processing • Scalable storage systems • Ultra-high speed network for data transfer: 100Gbps networks End system connectivity Courtesy of Mellanox
End-to-End 100G Networking End-to-End Networking at 100 Gbits/s 100 G APPS 100G APPS FTP 100 FTP 100 Our project and its role 100G NIC 100G NIC 100 Gbits/s Backbone
Problems Definition and Scope • Reliable Transfer (error checking and recovery) at 40/100G speed, burden on processing power • Coordinated data transfer flow efficiently traverses file systems and network, data path decomposition • data read-in/write-out: between disks and user memory/ (backend data path) • Transport: Source memory to destination host memory (frontend data path) • Cost-effective end-to-end data transfer (10x10GE v.s. 3*40GE, v.s. 1X100GE) from sources to sinks
Challenges and Requirements • File System Bottlenecks: how to do file stage-in/out • Kernel/software stacks slow, the same problem as TCP. • Look into software solutions to zero-copy • fopen, sendfile, splice, O_DIRECT, each has some problem or restriction. • Look into hardware RDMA to pull data directly into the user space. • Storage: • Need to support 100Gbps in/out by disk spindles (also SSD, Flash I/O) • Multiple RAID controllers to accomplish 100Gbps. • i.e. Multiple files need to be streamed into buffer in parallel. • Switch fabric interconnects disk servers and FTP 100 servers. • Storage Aggregation from disks into FTP server disk partition.
Opportunities • CPU: Multi-core/Many core/Hybrid processor of CPU/GPGPU for high processing bandwidth per machine • PCI express 3 now is available • IBM x3750 M4 with Ivy bridge to be general available June/18/2012 • Mellanox 40Gbps (true line speed) CE • Flash Drive directly on PCI bus or internal bus • Fusion I/O • Storage Class Memory
NUMA Arrangement Problem NUMA: Non-Uniform Memory Access (NUMA) further complicates the performance problem. Local Memory A NUMA Node To Remote Memory
Our Current Testbed Socket 0 Socket 1 MLNX_0 PCI Bridge PCI Bridge Node 0 Node 6 Node 2 Node 4 Memory Memory Node 7 Node 1 Node 5 Node 3 Hyper Transport Socket 2 Socket 3 PCI Bridge MLNX_1 PCI Bridge Memory Memory System has to be manually configured for performance
FTP 100 Design Requirements • Such high performance data transfer requires multiple file Read/Write. • Implement the buffer management (stream multiple files into a buffer in the system memory), and provide handshake with the backend file systems • Challenge of synchronization between Read/Write
System Design • Two modes: one based on the conventional TCP/IP stack with zero-copy and kernel bypass, and the other based on RDMA operations over various architectures. System design with RDMA mode
Middleware Layer • The core of our system is the middleware layer, which is responsible for resource management, task scheduling and synchronization, and parallelism of RMDA operations.
A Clean Design of RFTP Workflow • Separate control/data channels, convenient rput/rget user commands, and clean workflow for performance optimization
Extensions to Standard FTP • We implement a set of file transfer commands that are based on RDMA capability, and they support our rput/rget file transfer
Efficient Flow Control • We maximize the read/write of both network and disks by implementing a credit-based flow control An examle of rput:
Why RFTP over Other Tools? • We are aware of other popular data transfer tools and standards. They all attempt to achieve maximum available network bandwidth • GridFTP from Globus • UDT, which is implemented in GridFTP • Fasp from AsperaSoft • Fast TCP from Caltech • HighSpeed TCP, RFC 3649, and other window-based TCP variants
Advantages • Performance advantages • one-side RDMA semantics, minimum CPU consumption, efficient disk I/O operations, and minimum data-copy. • can easily saturate the network • Flexibility advantages • Two modes: (1) OFED, RDMA operation, and (2) TCP/IP, with kernel zero-copy optimization (sendfile/splice) • Development/deployment advantages • Light-weighted, portability and compatibility (with OFED and minimum upgrade of Linux kernel).
Project Status and Software Features • FTP over RDMA and TCP was implemented. The code was released to Climate 100 project • RDMA • Direct-I/O • Sendfile and splice • Handle multiple files with various sizes, and directories • Parallel transfer of single large files • We continue to improve the software systems • One IPDPS HPGC workshop paper accepted, and an extended version for journal submission
Demo Systems • We held a demo at SuperComputing’2011 • The objectives of our demo include, • Validation of the performance, availability, and reliability of RFTP • Demonstration of line-speed bandwidth achieved in networks with 40Gbps to 100Gbps
Testbed Evaluation • 100Gbps WAN testbed (our deployment network) • 100Gbps IP link between NERSC, ANL, and ORNL • Maximum physical disk bandwidth of 80Gbps • 10+ pairs of host-to-host connections between NERSC and ANL, and between NERSC and ORNL. • 40Gbps LAN/MAN testbed (our development network) • 40Gbps RDMA link in the Long Island area • Up to 40Gbps SSD bandwidth We configure and tune our system for good performance, e.g., adjust TCP parameters and variable file sizes,
Example Disk File Transfer Performance Results in DOE ANI Testbed NERSCORNL ANL, November 17 at SC’11
MAN Performance Results • Metropolitan NY testbed • RFTP saturates the physical link with a modest data block size
WAN Performance Results • Disk-to-disk, NERSC to ORNL, 9 host pairs • Memory-to-memory, NERSC to ORNL, 14 host pairs • Fill Up 100Gbps DOE ANI networks
Future Work • NUMA programming, and assign CPU/Interrupt/Memory/PCI bridge to improve data transfer rate in multi-core systems • Work with ESNet to get RoCE work on DOE ANI testbed • Work with ORNL to migrate our FTP system directly on layer 2 Ethernet-based data transport (Common Communication Interface) • Evaluate our software on host with PCI Gen3 and Mellanox 40Gbps cards • Emulate long-latency networks, to improve RDMA-base data transfer, and compare with other data transfer applications (GridFTP, BBCP, etc) Adding delay and losses Data sink Emulator with netem Data source 3rd host as a bridge
Live Demo • 100Gbps WAN testbed • 14 pairs of memory-to-memory connections between NERSC and ANL • 9 pairs of disk-to-disk connections between NERSC and ORNL, file sizes are close to 100G bytes at each host. • Gangalia to monitor aggregate traffic rate out of NERSC • 40Gbps LAN testbed at BNL • 40Gbps RDMA RoCE link between two hosts at BNL, netqos03,netqos04 • Memory-to-memory data transfer
LAN Performance Results • Local RDMA links • RFTP achieves near line-speed throughput