210 likes | 508 Views
U-Net: A User-Level Network Interface for Parallel and Distributed Computing. T. von Eicken, A. Basu, V. Buch and W. Vogels Cornell University Appears in SIGOPS 1995 Presented by: Joseph Paris. Introduction. There has been a shift in local area network bottleneck
E N D
U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu, V. Buch and W. Vogels Cornell University Appears in SIGOPS 1995 Presented by: Joseph Paris
Introduction • There has been a shift in local area network bottleneck • Traditionally, limited bandwidth • Now we see an issue in the message path through software • Taking a look at the UNIX networking architecture • Message path through the kernel consists of • Several Copies • Crossing multiple levels of abstraction between device drivers and user applications • Resulting in…. Overhead • We observe that the processing overheads limit the peak communication bandwidth and result in high latency • So, the upgrades in networking technology largely go unnoticed to the general user community • Vendor supplied problem? • May think of large data-stream cases and less about per message overhead
Observation • Most applications use relatively small messages and rely heavily on quick round-trip requests and replies • Distributed shared memory • Remote procedure calls • Remote object-oriented method invocations • Distributed cooperative file caches • And, they could also benefit from more flexible interfaces to the network • Traditional architecture cannot easily support new protocols/interfaces • Integrating application specific information into protocol processing • Higher efficiency • Greater flexibility • I.e. Video, Audio, Transferring directly from data structures
Motivation Low end-to-end communication latencies • Separating processing overhead from network latency • Distributed Systems • Object-oriented Technology • Objects are generally small (100 bytes vs. Kbytes) • Electronic workplace • Simple database servers that handle object naming, location, authentication, protection. (20-80bytes for requests, 40-200 bytes for response) • Cache Coherence • Keeping copies consistent introduces a large number of small coherence messages. • Fault-tolerance Algorithms/Group Communication • Global locks, scheduling, coherence • RPC’s, file systems, etc.
Motivation • Small message Bandwidth • Same trends that demand low latencies also demand high bandwidth for small messages • Object-oriented Technology, Electronic workplace, Cache Coherence, RPC’s, etc • Part of decreasing the overall end-to-end latency is having high-bandwidth technology for small messages • Basically, we want full network bandwidth with as small messages as possible • Protocol Interface Flexibility • Traditionally • protocol stacks are implemented as part of the kernel • Lack of integration of kernel and application buffer management • Solution • Remove the comm. Subsystem’s boundary with the application specific protocols • Tight coupling between the comm. Protocol and the application
Solution - Unet • Why? • Focus on low latency and high bandwidth using small messages • Emphasis on protocol design and integration flexibility • Desire to meet goals on widely available ‘off the shelf’ hardware • How? • Simply, remove the kernel from the critical path of sending and receiving messages • Eliminates the system call overhead • Offers opportunities to streamline the buffer management • What’s required? • Virtualizing the network interface among processes • Protection such that processes using the network cannot interfere with each other • Message Multiplexing and De-Multiplexing • Managing communication resources without the kernel • Efficient and Versatile programming interface to the network
Design & Implementation of U-Net • Virtualize the network interface in such a way that a combination of OS and hardware mechanisms can provide the illusion of owning the interface • In hardware • Components manipulated by a process correspond to real hardware • In software • Memory locations are interpreted by the OS • Both • The Role of U-Net is limited to • Multiplexing the actual network interface among all processes • Enforcing protection boundaries • Enforcing consumption limits • This leaves the process with control over • Contents of the message • Management of send and receive resources (such as buffers)
Design & Implementation of U-Net • We have 3 main building blocks • Endpoints • Serve as an applications handle into the network and contain… • Communication Segments • Regions of memory that hold message data • Message Queues • Holds descriptors for messages that are to be sent or have been received • Each process that wants to access the network • Creates one or more endpoints • Associates a communication Segment with the endpoint • And a set of send, receive, free message queues
Design & Implementation of U-Net • Sending • User process composes the data in the communication segment • Pushes a descriptor for the message onto the send queue • At this point the network interface is expected to pick the message up and insert it into the network • If there is a back-up • Leave the descriptor in the queue • Eventually exert back-pressure to the user process when the queue becomes full • Receiving • Messages are de-multiplexed based on their destination • Data is transferred to the appropriate comm. Segment • The message descriptor is pushed onto the corresponding receive queue • Receive model notification • Polling • Blocks waiting for the next message to arrive via the UNIX select call • Event Driven • Register an Up-Call • Signals the state of the receive queue that satisfies a certain condition • Only two conditions currently supported • Queue is non-empty • Queue is almost full • In order to keep performance high (and cost low) all messages can be consumed on a single up-call
Design and Implementation of U-Net • Multiplexing and De-Multiplexing Messages • Uses a tag in each incoming message to determine • destination endpoint • Comm. Segment • Message queue descriptor • Exact form of the message tag depends on the network substrate • i.e. ATM uses virtual channel identifiers • Getting the tag via an OS level service assists in • An application in determining the correct tag to use based on a specification of the destination process and the route between the two nodes • route discovery • Switch-path setup • other signaling that is specific to the network technology • Authentication and authorization • Performs checks to ensure that the application is allowed to access specific network resources • Also checks to make sure there are no conflicts with other applications
Design and Implementation of U-Net • Base-level Architecture • Hardware cannot support Direct-Access • “True Zero-Copy” where data can be sent directly out of the applications data structures without intermediate buffering • Requires special memory mapping to span the entire processes address space into the network interface • So we only get “Zero-Copy” support for now • Which in reality requires a single copy, namely between the application’s data structures and a bugger in the communication segment • Queue based interface to the network • Stages messages in a limited size comm. Segment on their way between application data structures and the network • Send and Receive queues hold descriptors with information about the destination, origin, endpoints, length, as well as offsets within the comm. segment • Management of the send buffer is entirely up to the process • Must be properly aligned for the requirements of the network interface • Cannot control order in which messages are received into the Recv Buffer • Free queues hold descriptors for free buggers that are made available to the network interface for storing arriving messages • Small Message Optimization • Send and recv queues may hold entire messages in descriptors (instead of pointers to data) • Avoids buffer management and can improve round-trip latency
Evaluation • Two U-Net implementations • SBA-100 • Non-programmable, completely done in software • Performance sucks • 33-40% increase in overhead due to ATM header CRC calculation being done in software • SBA-200 • Programmable, custom firmware • Reflects the base-level U-Net architecture in hardware • Three tests • U-Net Active Messages Implementation (UAM) • Active messages is a mechanism that allows efficient over-lapping of communication with computation in multi-processors • Communication in form of requests and matching replies • Split-C • Parallel extension to C for programming distributed memory machines using a global address space abstraction • Comprises of one thread of control per process from a single code image and the threads interact through reads and writes on shared data • Implemented with U-Net Active Messages • TCP/UDP
Evaluation • Active Message (UAM) U-Net bandwidth as a function of message size U-Net round-trip time as a function of message size
Evaluation • Split-C Using UAM Overall Execution Time CPU and Network Breakdown for two applications
Evaluation • TCP/UDP U-Net TCP Bandwidth as a function of Message Size U-Net UDP Bandwidth as a function of Message Size
Evaluation • TCP/UDP Latency as a function of Message Size
Conclusion • Processing overhead on messages has been minimized • Latency experienced by the application is once again dominated by the actual message transmission time • Simple networking interface that supports traditional inter-networking protocols and abstractions such as Active Messages • Demonstrates that removing the kernel from the communication path can offer new flexibility in addition to high performance • TCP/UDP protocols achieve latencies and throughput close to the raw maximum