470 likes | 623 Views
Enabling a connection-oriented internet. Outline Background CHEETAH Testbed Software Applications: GridFTP + Ensight Security Extension of work to connection-oriented internet Networking research problems. Malathi Veeraraghavan Univ. of Virginia mv@cs.virginia.edu.
E N D
Enabling a connection-oriented internet • Outline • Background • CHEETAH • Testbed • Software • Applications: GridFTP + Ensight • Security • Extension of work to connection-oriented internet • Networking research problems Malathi Veeraraghavan Univ. of Virginia mv@cs.virginia.edu Talk at the BNL, May 16, 2005
Team & Acknowledgment • Team (PI/Co-PIs): • Malathi Veeraraghavan, Univ. of Virginia • Nagi Rao, Bill Wing, Tony Mezzacappa, ORNL • John Blondin, NCSU • Ibrahim Habib, CUNY • UVA funding sources: • NSF EIN grant ANI-0335190 • NSF ITR small grant ANI-0312376 • DOE FG02-04ER25640
UVA team members • Postdocs: • Xuan Zheng • Haobo Wang • Graduate students • Xiuduan Fang (GridFTP/PVFS) • Zhanxiang Huang (MPLS) • Tao Li (testing) • Anant P. Mudambi (new transport protocol: FRTP) • Xiangfei Zhu (RSVP-TE)
eScience application reqmts. for the network • Our eScience partner: TSI project • High bandwidth end-to-end for terabyte sized file transfers • End-to-end QoS assurance • remote visualization • remote computational steering TSI: Terascale Supernova Initiative QoS: Quality of Service
“Communications” answer(no networking) • If • number of end points is small • costs are not prohibitive • Purchase high-speed communication links end-to-end • Problems to solve: • Limited to end hosts • Disk limitations, bus speed limits, etc.
Important questions for networking • How many end users/hosts need access to the network? • Where are they located? • Role of networking: enable sharing • Answers for HEP community • 300-400 physicists in the US • Located in 30 univs/labs • 2000 physicists from 150 institutions world-wide
Networking community’sanswers to eScience needs • Use existing networks (Internet2/ESnet); upgrade links • Improve TCP by only changing the software at the end hosts • High-speed TCP, Scalable TCP, FAST • Parallel TCP (e.g. GridFTP) • Pros: • Easy to deploy • Cons: • End-to-end QoS assurance Two answers • Deploy new networks (switches) while upgrading links • Connection-oriented • e.g., Cheetah, Dragon, Canarie’s CA*net4 • Complements type of service offered by Internet2 • Pros: • End-to-end QoS • Cons: • Cost
A little background on networking • What’s a network (wide-area)? • A bunch of communication links interconnected by network SWITCHES • Purpose of a network switch: • enable sharing of communication link resources Generate/store data that needs to be moved Direct communication links: doesn’t scale End host End host SWITCH End host End host
The fun in networks research: Type of sharing • Connectionless (CL) switch (packet switch) • No explicit requests to reserve bandwidth prior to data transfer • All control for bandwidth sharing is implemented at end hosts – TCP software • TCP keeps increasing the rate at which it sends packets • Congestion detected • Rate of sending packets dropped • Slowly starts increasing rate
TCP at end host controls sending rate sending rate increases • Bandwidth share given to one flow keeps varying depending on how many other flows join and leave within the first flow’s duration • Socialistic sharing • Fair but hard to provide rate guarantees sending rate decreased Connectionless packet switch
A second type of sharing • Connection-oriented (CO) • Request a reservation • If accepted, can be guaranteed the assigned rate • Release reservation when done • Note: uncertainty in whether BW request will be granted BW-request BW-request Bandwidth manager Bandwidth manager Connection-oriented switch (packet or circuit) Connection-oriented switch (packet or circuit) Multiplexed: lanes on a highway Complete bandwidth on a link used for one connection Distributed bandwidth management Scales to large networks
CHEETAH (Circuit-Switched High-speed End-to-End Transport Architecture) • Falls in the second category • deploying a new network • connection-oriented flavor of sharing • Meets TSI needs • High-bandwidth connections for file transfers • End-to-end QoS for remote visualization • Cons • High cost
CHEETAHTopology & equipment Time-division multiplexing optical interface card Hosts Hosts Ethernet switch Ethernet switch Hosts Ethernet switch Enterprise networks bandwidth manager: dynamic distributed sharing Gb/s and 10Gb/s Ethernet interface cards To UltraScience Net 5 GbEs NCSU Control Raleigh PoP (MCNC) (Sycamore SN16000) ORNL PoP (Sycamore SN16000) OC192 OC192 Atlanta PoP (SoX/SLR) (Sycamore SN16000) OC192 (NLR, SLR) GbE Maps GbE to equivalent SONET circuit G. Tech
CHEETAH concept • Use off-the-shelf circuit-based gateways • that support GMPLS routing and signaling protocols for dynamic circuit setup/release • enables the creation of large-scale shared CO networks • It is not a standalone network • Leverages the presence of connectionless IP service (host-to-host IP connectivity; DNS) • Implement cheetah software to run on end hosts • Integrate with host applications • applications generate requests for bandwidth as needed • SHORT-LIVED: increase sharing • Hold circuit for a few seconds/minutes and release
Cheetah solution leveraging the presence of the Internet • Use second NICs at hosts for circuit connectivity leaving primary NIC for Internet access Connectionless Internet Two paths available End host I End host II Circuit-Switched Network Should we attempt a circuit setup for ALL file transfers? • Attempt circuit setup • If rejected, fall back to using TCP/IP Or is there a crossover file size below which we use the TCP/IP network and above which we attempt a circuit setup?
Two metrics: delay and utilization • For most regions of operation on 1Gbps circuits • in wide-area scenarios (50ms prop. delay) • delay: crossover size ~10KB • utilization: crossover size is ~50MB • in local-area scenarios (1ms prop. delay) • delay: crossover size is 1s to 10s of MBs • utilization: 1s of MBs
Cheetah software on end hosts DNS query (to check if far end host is also on cheetah) Routing decision to check whether to use the TCP/IP path or attempt a cheetah circuit setup Signaling client to request a circuit Fixed-Rate Transport Protocol (FRTP) designed for circuits
File transfer Matlab Network protocols Filesystem network card Circuit-switched network Transport protocol problem • Variability in sender: • other processes (e.g. matlab) + disk access (disk head location) • Variability in receiver: if buffer not emptied out, data loss occurs File transfer Matlab user space Filesystem Network protocols kernel network card
Effects of mismatch in nature of circuits and nature of hosts • Choose a high circuit rate and receive buffer can fill up if circuit rate is not matched to receive rate • impacts delay + utilization • Choose a low circuit rate and delay will be higher than necessary • If sending rate is not matched exactly with circuit rate • circuit lies idle; utilization impacted
Transport protocol for end-to-end dedicated circuits • Requirements & solution: • No contention for bandwidth resources in network during user data flow (bandwidth already reserved) • No congestion control • Contention at end-hosts due to multitasking • Flow control: null or window based • Reliable transfer: error control • Detect/recover from drops in receiver buffer • High circuit utilization • Keep sending rate fixed to match circuit rate • Hence the name Fixed-Rate Transport Protocol (FRTP) • Receive rate selection important • Disk-to-disk transfer • FRTP module is handed a file descriptor instead of a buffer location in main memory
FRTP Implementation I • Null flow control • data blocks can get dropped at receiver • disk access variability and multitasking • recover through retransmissions • Implementation in user-space • Opens UDP and TCP sockets • UDP data channel on unidirectional dedicated circuit • TCP control channel on primary Internet path • Modified SABUL code
FRTP Implementation I (cont.) • SABUL implementation: • Uses busy-wait to maintain fixed low inter-packet times (to achieve fixed sending rate) • Drawback: high CPU utilization • Modified to: • send a burst of packets periodically • set a periodic timer; when process gets a signal indicating timer expiry, send a batch of packets • use data link layer flow control (i.e., Ethernet PAUSE) to prevent bursts
FRTP Implementation II • Window-based flow control • prevents data blocks from being dropped at receiver • due to disk access variability and multitasking • Implementation in kernel-space • Uses Web100/Net100 code
Modifications • Web100/Net100 • Implement TCP • Added hooks to tune parameters at run-time • FRTP usage of this code/modifications • Used tuning capability to set: • initial ssthresh to the Bandwidth Delay Product • using fixed circuit rate for bandwidth • Made code modifications to set: • additive increase (AI), multiplicative decrease (MD) factors to 0 (sending rate will not change in congestion avoidance) • Sending rate increases to the circuit rate and stays there • Added advantage: • TCP’s self-clocking is a pretty good way to maintain fixed sending rate
Disk-to-disk transfer requirement • Sender side actions: • read() system call: • move a block from disk to user space memory • send() system call: • write the block to network socket • sendfile(): reduces # of copies and # of system calls • Receive side actions: • open the file with the O_LARGEFILE flag • calibrate disk write rate limits • select file system (xfs, pvfs) • if multitasking receiver, use RT schedulers to schedule disk write thread to match circuit rate
GridFTP application • Disk considerations • Hardware solution: RAID striping • expensive solution • Split large file into small files and store small files on disks of different hosts in a cluster • not user-friendly • GridFTP striping with PVFS2 - striping across disks of different hosts of a cluster • best solution, but both GridFTP and PVFS2 code need modifications to use on dedicated circuits
Hardware solution • Equip host with a fast CPU, a RAID controller and disk array and a 10Gbps NIC … Data blocks 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 … … … … … RAID 0 with 5 disks (Striping)
File splitting ORNL NCSU
Implementation of file splitting • Use GridFTP partial transfer feature • But disk space allocated on each host needs to equal the whole file size • Without GridFTP partial transfer • Wrote C programs to partition and assemble the file • Use any file transfer tool to transfer the partitions (which are distinct files) in parallel • Tools with third party transfer utility desirable
GridFTP striped transfer over PVFS2 • PVFS2 (Parallel Virtual File System) • Three kinds of roles for nodes in PVFS2 • Compute node/client: on which applications are run • Metadata server: handles metadata operations • I/O nodes/server: stores file data for PVFS2 file systems • Stripes a file across multiple servers like RAID0 • But • The latest version 1.0.1 does not provide any specific utility to inspect data distribution • The pvfs2-cp tool ignores the –s option for configuring striped size
Our work on PVFS2 • Determined how PVFS2 stripes files across hosts • Using strace command that provides a trace of systems calls • Analyzed how the file is striped • Utility pvfs2-fs-dump gives the order of I/O servers used for file distribution (order obtained from config. file) • Change pvfs2 code: • PVFS2 stripes files starting with a random server (done with PINT_cached_config_get_next_io() function call in file src/common/misc/pint-cached-config.c)jitter = (rand() % num_io_servers); • Change it into jitter = -1 to get a fixed order of data distribution • Change the default striped size (original: 64KBytes)
But the current GridFTP does not work in this ideal way. The data channel connections between the sending and receiving sides are arbitrary because the processing of SPAS and SPOR commands is nondeterministic. • Does not match with the dedicated circuit model • Code being modified Block 1 Block 1 Block 2 Block 2 Block 4 Block 4 Block 5 Block 5 Mode E SPOR (Connect) - connect to the host-port pairs RETR <FileName> Mode E SPAS (Listen) - returns list of host: port pairs STOR <FileName> globus-url-copy Control Control Host X1 Host A1 Host X2 Host A2
Security: control plane • Cannot have arbitrary end hosts send bandwidth request messages to network switch • Place VPN server/firewall in front of SN16000’s control port • provides some DDoS attack protection (ns5) • Establish IPsec tunnels between each host and SN16000 through VPN server • Openswan software at Linux end hosts • Establish IPsec tunnels between SN16000s
Security: data plane • GridFTP security • Between client and server • Can use ssh, ssl, ipsec between end hosts connected on cheetah circuit
Back to outline • Outline • Background • CHEETAH • Testbed • Software • Applications • Security • Extension of work to connection-oriented internet • Network research problems Talk at the BNL, May 16, 2005
Networking community’sanswers to eScience needs • Use existing connectionless networks with improved TCP Three answers Two answers • Deploy new connection-oriented networks • (e.g., cheetah) • Enable connection-oriented service in already deployed switches • MPLS • VLANs • Spend money upgrading links
Extend CHEETAH concept and apps to connection-oriented internet • On Internet2/ESnet: deployed IP routers (Cisco and Juniper) have: • MPLS capability (with RSVP-TE) • Connection-oriented service • Can map packets from one flow (five-tuple) to a reserved MPLS tunnel • Within LANs: • Ethernet switches have IEEE 802.1q VLAN capability • Ingress rate shaping • With external control software, can make these switches operate in connection-oriented mode
Many advantages to this approach • Already deployed (“just” enable!) • Bandwidth granularity can be low • improves bandwidth utilization • Allows for sharing of link bandwidth between CL and CO traffic
CO “internet” • Because heterogeneous connections will be needed at least in the short-term • MPLS segments (Label Switched Paths) • VLAN segments (within enterprises) • SONET circuits (popular in commercial world) • WDM lightpaths (research testbeds)
Back to outline • Outline • Background • CHEETAH • Testbed • Software • Applications • Security • Extension of work to connection-oriented internet • Network research problems Talk at the BNL, May 16, 2005
Networking research problems • Bandwidth sharing modes • Low load performance • Scheduled vs. immediate-request • Multi-level problem • Partial-path reservations • Fairness
1 1 1 1 2 2 2 2 3 3 3 3 . . . . N N N N Each transfer is allocated C/N capacity Each transfer gets C/N capacity Fixing the bandwidth for the transfer could be a bad thing: low load problem • Varying bandwidth list scheduling algorithm • uses knowledge of file size to make varying bandwidth allocations for transfer • catch: requires circuit switches to be reprogrammed multiple times within lifetime of a transfer (circuit) Capacity C Capacity C Circuit Switch Packet Switch The lone remaining transfer enjoys full capacity C The lone remaining transfer continues with capacity allocation C/N
Scheduled vs. immediate-request calls • Session type requests: • long holding times (2 hours) • specific rate • remote visualizations • scientists participate in sessions • best served with an advance reservation • File transfer requests: • file sizes provided not holding times • max rate specified but any rate can be allocated • scientists not involved; just computers • Large files (e.g. 1 TB on 1 Gbps takes 2.2 hours) • should be handled in scheduled mode • should we allocate 10Gbps and finish in 800 sec? • immediate-request? or scheduled? • depends on m, the number of 10Gbps circuits • Small files (e.g. 1 GB on 1 Gbps takes 8 sec) • should be handled in immediate-request mode
Multi-level problem • A new problem: not yes/no but how much? • Real-time (interactive) audio-video applications generate data at a certain rate (constant or variable) • implication: application requests the required bandwidth from the network, and answer is binary (accept or reject); multiple classes • File transfers: “any” bandwidth that the network can provide could be acceptable • implication: application requests a MAX bandwidth, but the answer can be multi-level
Partial-path reservations Enterprise • Peel off bandwidth (partial-path reservation) • Put back for CL traffic use when done 10Gbps in WANs Enterprise Enterprise Wide-area network (Abilene, ESnet backbone) GbE inside enterprises GbE inside enterprises Enterprise Set up MPLS tunnels dynamically for individual flows on bottleneck links Typically the access link is the bottleneck
Fairness • Call admission algorithms • Use Markov Decision Process (MDP) tools to balance fairness and overall throughput • Long-path and short-path calls • Large files (high-BW; high holding time) and short files (low-BW; low holding time) calls • Multi-level answer rather than binary accept/reject • CO traffic vs. CL traffic • Both with Fixed bandwidth and Varying bandwidth
Conclusions • End-to-end dedicated connections appear to be the right answer for many eScience applications • But, many networking problems need to be solved to achieve cost reduction through scaling • Utilization concerns: bandwidth sharing + FRTP • Specific concerns of TSI: TB file handling • PVFS2 and GridFTP • Web site: http://cheetah.cs.virginia.edu