370 likes | 586 Views
Outline. Project Personnel UpdateDantong Yu, Thomas RobertazziPost-doctoral associates Qian Chen (September/27/2010)Shudong Jin (Oct./01/2010)Student members: Yufen Ren, Tan Li, Rajat Sharma Project Introduction and challengesSoftware ArchitectureProject Plan and Intermediate TestbedTechn
E N D
1. 100 GFTP: An Ultra-High Speed Data Transfer Service Over Next Generation 100 Gigabit Per Second Network Dantong Yu
Stony Brook University/Brookhaven National Lab
2. Outline Project Personnel Update
Dantong Yu,
Thomas Robertazzi
Post-doctoral associates
Qian Chen (September/27/2010)
Shudong Jin (Oct./01/2010)
Student members: Yufen Ren, Tan Li, Rajat Sharma
Project Introduction and challenges
Software Architecture
Project Plan and Intermediate Testbed
Technical Discussion between RDMA v.s. TCP
3. End-to-End 100G Networking
4. Problems Definition and Scope Conventional data transfer protocol (TCP/IP) and file I/O have performance gaps.
Reliable transfer (error checking and recovery) at 100G speed
Coordinated data transfer flow efficiently traverses file systems and network, data path decomposition
data read-in: from source disk to user memory (backend data path)
Need External Collaborators to work on this together
Transport: Source memory to destination host memory (frontend data path)
Data write-out: from user memory to destination disks (backend data path)
Cost-effective end to end data transfer (10x10GE v.s. 1X100GE) from sources to sinks
Reduced port counts.
5. Challenges (Manageable) Host System Bottlenecks:
Intel Architecture: Quick Path Interface:
Theoretical Rate: 6.4 GT/s, 6.4*16(effective link width)*2 (two links for bidirectional)/8 = 25.6GByptes
AMD Architecture: HyperTransport
For HT 3.1, 16 bits bus width gives the same rate 25.6GBytes.
Requires: PCI-2.0 (500MB per lane) x 16 = 8GB (one direction) is required.
PCI and PCI-based network card
All NIC are PCI-2.0 (500MB per lane) x 8 = 4GB (one direction)
Fastest PCI-2.0 (500MB per lane) x 16 = 8GB (one direction) is required for 40Gbps.
PCI-3.0 x 16 which doubles the speed of PCI-2.0 is required for 100Gbps.
6. Challenges with some uncertainties and Proposed Solution File System Bottlenecks: how to do file stage-in/out
Kernel/software stacks slow, the same problem as TCP.
Look into the zero Copy, Data was moved into the user space in one copy.
Fopen, sendfile, O_DIRECT, each has some problem or restriction.
Look into Lustre RDMA to pull data directly into the user space.
Can a single file client (single server) pull files in the speed of 100Gbps?
Look for collaborators who have this type of expertise.
Storage:
Need to support 100Gbps In/Out by disk spindles
Multiple RAID controllers (large cache).
LSI 3ware supports up to 2.5GByte/second READ.
Multiple RAID controllers to accomplish 100Gbps.
i.e. Multiple files need to be streamed into buffer in parallel.
Switch fabric interconnects disk servers and FTP 100 servers.
Storage Aggregation from disks into FTP server disk partition.
Look for collaborators who have this type of expertise.
7. FTP 100 Design Challenges Such high performance data transfer requires multiple file R/W.
Implement the buffer management (stream multiple files into a buffer in the system memory or NIC card memory), and provide handshake with the backend file systems
Challenge of synchronization between Read/Write
8. End System Multi-Layer Capability View 8
9. FTP Develop with OpenFabrics rdmacm: rdma communication. User space libraries for establishing RDMA communication. Includes both Infiniband specific and general RDMA communications management libraries for unreliable datagram, reliable connected, and multi-cast data transfers.
libibverbs is a library that allows userspace processes to use InfiniBand/RDMA "verbs" directly. rdmacm: rdma communication. User space libraries for establishing RDMA communication. Includes both Infiniband specific and general RDMA communications management libraries for unreliable datagram, reliable connected, and multi-cast data transfers.
libibverbs is a library that allows userspace processes to use InfiniBand/RDMA "verbs" directly.
rdmacm: rdma communication. User space libraries for establishing RDMA communication. Includes both Infiniband specific and general RDMA communications management libraries for unreliable datagram, reliable connected, and multi-cast data transfers.
libibverbs is a library that allows userspace processes to use InfiniBand/RDMA "verbs" directly.
10. An example of ftp via OpenFabric
11. One Year Roadmap
12. 25 Gbps Lustre System Testbed IN Plan
13. 40 Gbps Data Transfer Testbed for December/2010
14. 100 Gbps Data Transfer Testbed Proposal
15. Conclusion For one data transfer stream, the RDMA transport is twice as fast as TCP, while the RDMA has only 10% of CPU load compare with the CPU load under TCP, without disk operation.
FTP includes two components: Networking and File operation. Compare with the RDMA operation, file operation (limited by the disk performance) takes most of the CPU usage. Therefore, a well-designed file buffer mode is critical.
16. Future work Setup Lustre environment, and configure Lustre with RDMA function
Start FTP migration to RDMA
Source control
Bug database
Document
Unit Test
17. SOME PRELIMINARY RESULTS
18. Current Environment Whether there is a switch? Between the two server?
Whether there is a switch? Between the two server?
19. Tool - iperf Migrate iperf 2.0.5 to the RDMA environment with OFED(librdmacm and libibverbs).
2000+ Source Lines of Code added.
From 8382 to 10562.
iperf usage extended
-H: RDMA transfer mode instead of TCP/UDP
-G: pr(passive read) pw(passive write)
Data read from server.
Server writes into clients.
-O: output data file, both TCP server and RDMA server
Only one stream to transfer
20. Test Suites test suits 1: memory -> memory
test suits 2: file -> memory -> memory
test case 2.1: file(regular file) -> memory -> memory
test case 2.2: file(/dev/zero) -> memory -> memory
test suits 3: memory -> memory -> file
test case 3.1: memory -> memory -> file(regular file)
test case 3.2: memory -> memory -> file(/dev/null)
test suits 4: file -> memory -> memory -> file
test case 4.1: file ( regular file) -> memory -> memory -> file( regular file)
test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null)
21. File choice File operation with Standard I/O library
fread, fwrite, Cached by OS
Input with /dev/zero wants to test the maximum application data transfer include file operation read, which means disk is not the bottleneck
Output with /dev/null wants to test the maximum application data transfer include file operation write, which means disk is not the bottleneck
22. Buffer choice RDMA operation block size is 10MB
RDMA READ/WRITE one time
Previous experiment shows that, in this environment, if the block size more than 5MB, there is little effect to the transfer speed
TCP read/write buffer size is the default
TCP window size: 85.3 KByte (default)
23. Test case 1: memory -> memory CPU
24. Test case 1: memory -> memory Bandwidth
25. Test case 2.1: (fread)file(regular file) -> memory -> memory CPU
26. Test case 2.1: (fread)file(regular file) -> memory -> memory Bandwidth
27. Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory CPU
28. Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory Bandwidth
29. Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) CPU Bandwidths are almost the same!Bandwidths are almost the same!
30. Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) Bandwidth Bandwidths are almost the same!Bandwidths are almost the same!
31. Testcase 3.2: memory -> memory -> file(/dev/null) CPU
32. Testcase 3.2: memory -> memory -> file(/dev/null) Bandwidth
33. Test case 4.1:file(r) -> memory -> memory -> file(r) CPU
34. Test case 4.1:file(r) -> memory -> memory -> file(r) Bandwidth
35. Test case 4.2:file(/dev/zero) -> memory -> memory -> file(/dev/null) CPU
36. Test case 4.2:file(/dev/zero) -> memory -> memory -> file(/dev/null) Bandwidth