630 likes | 795 Views
Building Network-Centric Systems Liviu Iftode. Before WWW, people were happy. E-mail, Telnet. Mostly local computing Occasional TCP/IP networking with low expectations and mostly non-interactive traffic local area networks: file server (NFS)
E N D
Before WWW, people were happy... E-mail, Telnet • Mostly local computing • Occasional TCP/IP networking with low expectations and mostly non-interactive traffic • local area networks: file server (NFS) • wide area networks -Internet- : E-mail, Telnet, Ftp • Networking was not a major concern for the OS TCP/IP Emacs NFS CS.umd.EDU CS.rutgers.EDU TCP/IP
One Exception: Cluster Computing Multicomputers Clusters of computers • Cost-effective solution for high-performance distributed computing • TCP/IP networking was the headache • large software overheads • Software DSM not a network-centric system :-(
The Great WWW Challenge Web Browsing http://www.Bank.com • World Wide Web made access over the Internet easy • Internet became commercial • Dramatic increase of interactive traffic • WWW networking creates a network-centric system: Internet server • performance: service more network clients • availability: be accessible all the time over the network • security: protect resources against network attacks TCP/IP Bank.com
Network-Centric Systems Networking dominates the operating system • Mobile Systems • mobility aware TCP/IP (Mobile IP, I-TCP, etc), disconnected file systems (Coda), adaptation-aware applications for mobility(Odyssey), etc • Internet Servers • resource allocation (Lazy Receive Processing, Resource Containers), OS shortcuts (Scout, IO-Lite), etc • Pervasive/Ubiquitous Systems • Tiny OS , sensor networks (Directed Diffusion, etc), programmability (One World, etc) • Storage Networking • network-attached storage (NASD, etc), peer-to-peer systems (Oceanstore, etc), secure file systems (SFS, Farsite), etc
Big Picture • Research sparked by various OS-Networking tensions • Shift of focus from Performance to Availability and Manageability • Networking and Storage I/O Convergence • Server-based and serverless systems • TCP/IP and non-TCP/IP protocols • Local area, wide-area, ad-hoc and application/overlay networks • Significant interest from industry
Outline • TCP Servers • Migratory-TCP and Service Continuations • Cooperative Computing, Smart Messages and Spatial Programming • Federated File Systems • Talk Highlights and Conclusions
Problem 1: TCP/IP is too Expensive Breakdown of the CPU time for Apache (uniprocessor based Web-server)
Traditional Send/Receive Communication App OS NIC NIC OS App send(a) copy(a,send_buf) DMA(send_buf,NIC) send_buf is transferred interrupt DMA(NIC,recv_buf) copy(recv_buf,b) receive(b) sender receiver
Multiprocessor Server Performance Does not Scale 700 Dual Processor 600 Uniprocessor 500 Throughput (requests/s) 400 300 200 100 0 300 350 400 450 500 550 600 650 700 750 Offered load (connections/s) Apache Web server 1.3.20 on 1 Way and 2 Way 300MHz Pentium II SMP with repeatedly accessing a static16 KB file
TCP/IP-Application Co-Habitation • TCP/IP “steals” compute cycles and memory from applications • TCP/IP executes in kernel-mode: mode switching overhead • TCP/IP executes asynchronously • interrupt processing overhead • internal synchronization on multiprocessor servers causes execution serialization • Cache pollution • Hidden “Service-work” • TCP packet retransmission • TCP ACK processing • ARP request service • Extreme cases can compromise server performance • Receive livelocks • Denial-of-service (DoS) attacks
Two Solutions • Replace TCP/IP with a lightweight transport protocol • Offload some/all of the TCP from host to a dedicated computing unit (processor, computer or “intelligent” network interface) • Industry: high-performance, expensive solutions • Memory-to-Memory Communication: InfiniBand • “Intelligent” network interface: TCP Offloading Engine(TOE) • Cost-effective and flexible solutions: TCP Servers
Memory-to-Memory(M-M) Communication Sender Receiver Send Receive Application TCP/IP OS Network Interface (NIC) Memory Buffer Remote DMA M-M OS OS NIC NIC
Memory-to-Memory Communication is Non-Intrusive App NIC NIC App RDMA_Write(a,b) a transferred into b b is updated Sender: low overhead Receiver: zero overhead
TCP Server at a Glance • A software offloading architecture using existing hardware • Basic idea: Dedicate one or more computing units exclusively for TCP/IP • Compared to TOE • track technology better: latest processors • flexible: adapt to changing load conditions • cost-effective: no extra hardware • Isolate application computation from network processing • Eliminate network interrupts and context switches • Efficient resource allocation • Additional performance gains (zero-copy) with extended socket API • Related work • Very preliminary offloading solutions: Piglet, CSP • Socket Direct Protocol, Zero-copy TCP
Two TCP Server Architectures • TCP Servers for Multiprocessor Servers TCP-Server Server Appl TCP/IP CPU CPU Shared Memory • TCP Servers for Cluster-based Servers TCP/IP M-M TCP-Server Server Appl
Where to Split TCP/IP Processing? (How much to offload?) APPLICATION Application Processors SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler interrupt_handler packet_in TCP Servers
Evaluation Testbed • Multiprocessor Server • 4-Way 550MHz Intel Pentium II system running Apache 1.3.20 web server on Linux 2.4.9 • NIC : 3-Com 996-BT Gigabit Ethernet • Used sclients as a client program [Banga 97]
Comparative Throughput Clients issue file requests according to a web server trace
Adaptive TCP Servers • Static TCP Server configuration • Too few TCP Servers can lead to network processing becoming the bottleneck • Too many TCP Servers lead to degradation in performance of CPU intensive applications • Dynamic TCP Server configuration • Monitor the TCP Server queue lengths and system load • Dynamically add or remove TCP Server processors
Next Target: The Storage Networking • Storage Networking dilemma • non-TCP/IP solutions require new wiring or tunneling over IP-based Ethernet networks • TCP/IP solutions require TCP offloading TCP or not TCP? M-M Communication (InfiniBand) TCP Offloading iSCSI (SCSI over IP) DAFS (Direct Access File Systems)
Future Work: TCP Servers & iSCSI • Use TCP-Servers to connect to SCSI storage using iSCSI protocol over TCP/IP networks Server Appl TCP-Server & iSCSI SCSI Storage iSCSI CPU CPU TCP/IP Shared Memory
Problem 2: TCP/IP is too Rigid • Server vs. Service Availability • client interested in Service availability • Adverse conditions may affect service availability • internetwork congestion or failure • servers overloaded, failed or under DoS attack • TCP has one response • network delays => packet loss => retransmission • TCP limits the OS solutions for service availability • early binding of service to a server • client cannot switch to another server for sustained service after the connection is established
Service Availability through Migration Server 1 Client Server 2
Migratory TCP at a Glance • Migratory TCP migrates live connections among cooperative servers • Migration mechanism is generic (not application specific) lightweight (fine-grained migration) and low-latency • Migration triggered by client or server • Servers can be geographically distributed (different IP addresses) • Requires changes to the server application • Totally transparent to the client application • Interoperates with existing TCP • Migration policies decoupled from migration mechanism
Basic Idea: Fine-Grained State Migration Server1 Process Application state Connection state C2 Client C1 C2 C3 C4 C5 C6 Server2 Process
Migratory-TCP (Lazy) Protocol Server 1 Connect (0) Client < State Reply> (3) < State Request> (2) C’ Migration Request (1) Migration Accept(4) Server 2
Non-Intrusive Migration • Migrate state without involving old-server application (only old server OS) • Old server exports per-connection state periodically • Connection state and Application state can go out of sync • Upon migration, new server imports the last exported state of the migrated connection • OS uses connection state to synchronize with application • Non-intrusive migration with M-M communication • uses RDMA read to extract state from the old server with zero-overhead • works even when the old server is overloaded or frozen
Back-End Server Process2 Front-End Server Process Back-End Server Process1 socket pipe pipe exported state exported state exported state SC Pipe state Pipe state sc= create_cont(C1); p1=pipe(); associate(sc,p1); fork_exec(Process1); …. export(sc,state) SC API sc= open_cont(p1); … export(sc, state) sc= open_cont(p2); …. export(sc,state) Service Continuation (SC) Connection state
Related Work • Process migration: Sprite [Douglis ‘91], Locus [Walker ‘83], MOSIX [Barak ‘98], etc. • VM migration [Rosemblum ‘02, Nieh ‘02] • Migration in web server clusters [Snoeren ‘00, Luo ‘01] • Fault-tolerant TCP [Alvisi ‘00] • TCP extensions for host mobility: I-TCP [Bakre ‘95], Snoop TCP [Balakrishnan ‘95], end-to-end approaches [Snoeren ‘00], Msocks [Maltz ‘98] • SCTP (RFC 2960)
Evaluation • Implemented SC and M-TCP in FreeBSD kernel • Integrated SC in real Internet servers • web, media streaming, transactional DB • Microbenchmark • impact of migration on client perceived throughput for a two-process server using TTCP • Real applications • sustain web server throughput under load produced by increasing the number of client connections
SC2 SC3 Future Research: Use SC to Build Self-Healing Cluster-based Systems
Problem 3: Computer Systems move Outdoors Linux Car Sensors Linux Camera Linux Watch • Massive numbers of computers will be embedded everywhere in the physical world • Dynamic ad-hoc networking • How to execute user-defined applications over these networks?
Outdoor Distributed Computing • Traditional distributed computing has been indoor • Target: performance and/or fault tolerance • Stable configuration, robust networking (TCP/IP or M-M) • Relatively small scale • Functionally equivalent nodes • Message passing or shared memory programming • Outdoor Distributed Computing • Target: Collect/Disseminate distributed data and/or perform collective tasks • Volatile nodes and links • Node equivalence determined by their physical properties (content-based naming) • Data migration is not good • expensive to perform end-to-end transfer control • too rigid for such a dynamic network
Cooperative Computing at a Glance • Distributed computing with execution migration • Smart Message: carries the execution state (and possibly the code) in addition to the payload • execution state assumed to be small (explicit migration) • code usually cached (few applications) • Nodes “cooperate” by allowing Smart Messages • to execute on them • to use their memory to store “persistent” data (tags) • Nodes do not provide routing • Smart Message executes on each node of its path • Application executed on target nodes (nodes of interest) • Routing executed on each node of the path (self-routing) • During its lifetime, an application generates at least one, possibly multiple, smart messages
Execution migration ` Smart vs. “Dumb” Messages Mary’s lunch: Appetizer Entree Dessert Data migration
SM Execution 0 1 1 1 2 2 3 Routing Application migrate(tag,timeout) { do if (NextHot_tag) sys_migrate(NextHot_tag,timeout); else { spawn_SM(Route_Discovery,Hot); block_SM(NextHot_tag,timeout); until (Hot_tag or timeout); } do migrate(Hot_tag,timeout); Water_tag = ON; N=N+1; until (N==3 or timeout); Smart Messages Hot Hot Hot
Cooperative Node Architecure SM Arrival Virtual Machine SM Migration Admission Manager Scheduling Tag Space OS & I/O • Admission control for resource security • Non-preemptive scheduling with timeout-kill • Tags created by SMs (limited lifetime) or I/O tags (permanent) • global tag name space {hash(SM code), tag name} • five protection domains defined using hash(SM code), SM source node ID, and SM starting time.
Related Work • Mobile agents (D’Agents, Ajanta) • Active networks (ANTS, SNAP) • Sensor networks (Diffusion, TinyOS, TAG) • Pervasive computing (One.world)
Prototype Implementation • 8 HP iPAQs running Linux • 802.11 wireless communication • Sun Java K Virtual Machine • Geographic (simplified GPSR) and On-Demand (AODV) routing user node intermediate node node of interest Routing algorithm Code not cached (ms) Code cached (ms) Geographic (GPSR) 415.6 126.6 On-demand (AODV) 506.6 314.7 Completion Time
Self-Routing • There is no best routing outdoors • Depends on application and node property dynamics • Application-controlled routing • Possible with Smart Messages (execution state carried in the message) • When migration times out, the application is upcalled on the current node to decide what to do next
Self-Routing Effectiveness (simulation) • geographical routing to reach target regions • on-demand routing within region • application decides when to switch between the two starting node node of interest other node
Next Target: Spatial Programming • Smart Message: too low-level programming • How to describe distributed computing over dynamic outdoor networks of embedded systems with limited knowledge about resource number, location, etc • Spatial Programming (SP) design guidelines: • space is a first-order programming concept • resources named by their expected location and properties (spatial reference) • reference consistency: spatial reference-to- resource mappings are consistent throughout the program • program must tolerate resource dynamics • SP can be implemented using Smart Messages (the spatial reference mapping table carried as payload)
for(i=0;i<10;i++) What if <10 hot spots? if {Left_Hill:Hot}[i].temp > Max_temp Max_temp = {Left_Hill:Hot[I]}.temp; id = i; {Left_Hill:Hot}[id].water = ON; SpatialReference consistency Spatial Reference for Hot spots on Left Hill Spatial Programming Example Mobile sprinklers with temperature sensors Program sprinklers to water the hottest spot of the Left Hill Right Hill Left Hill Hot spot
Problem 4: Manageable Distributed File Systems • Most distributed file servers use TCP/IP both for client-server and intra-server communication • Strong file consistency, file locking and load balancing: difficult to provide • File servers require significant human effort to manage: add storage, move directories, etc • Cluster-based file servers are cost-effective • Scalable performance requires load balancing • Load balancing may require file migration • File migration limited if file naming is location-dependent • We need a scalable, location-independent and easy to manage cluster-based distributed file system
Federated File System at a Glance Global file name space over cluster of autonomous local file systems interconnected by a M-M network A2 A2 A3 A3 A3 A1 FedFS FedFS Local FS Local FS Local FS Local FS M-M Interconnect
Location Independent Global File Naming • Virtual Directory (VD): union of local directories • volatile, created on demand (dirmerge) • contains information about files including location (homes of files) • assigned dynamically to nodes (managers) • supports location independent file naming and file migration • Directory Tables (DT): local caches of VD entries (~TLB) usr virtual directory file1 file2 local directories usr usr file1 file2 Local file system 1 Local file system 2