590 likes | 749 Views
Communication in Tightly Coupled Systems. CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002. Why Parallel Computing? Performance!. Processor Performance. But not just Performance.
E N D
Communication in Tightly Coupled Systems CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002
Why Parallel Computing? Performance! CS 519: Operating System Theory
Processor Performance CS 519: Operating System Theory
But not just Performance • At some point, we’re willing to trade some performance for: • Ease of programming • Portability • Cost • Ease of programming & Portability • Parallel programming for the masses • Leverage new or faster hardware asap • Cost • High-end parallel machines are expensive resources CS 519: Operating System Theory
Amdahl’s Law • If a fraction s of a computation is not parallelizable, then the best achievable speedup is CS 519: Operating System Theory
1 p 1 Time Pictorial Depiction of Amdahl’s Law CS 519: Operating System Theory
Parallel Applications • Scientific computing not the only class of parallel applications • Examples of non-scientific parallel applications: • Data mining • Real-time rendering • Distributed servers CS 519: Operating System Theory
Centralized Memory Multiprocessors CPU CPU Memory cache cache memory bus I/O bus disk Net interface CS 519: Operating System Theory
Distributed Shared-Memory (NUMA) Multiprocessors CPU CPU Memory Memory cache cache memory bus memory bus I/O bus I/O bus network disk Net interface disk Net interface CS 519: Operating System Theory
Multicomputers CPU CPU Memory Memory cache cache memory bus memory bus I/O bus I/O bus network disk Net interface disk Net interface Inter-processor communication in multicomputers is effected through message passing CS 519: Operating System Theory
Send Receive P0 P1 N0 N1 Communication Fabric Basic Message Passing Send Receive P0 P1 N0 CS 519: Operating System Theory
Terminology • Basic Message Passing: • Send: Analogous to mailing a letter • Receive: Analogous to picking up a letter from the mailbox • Scatter-gather:Ability to “scatter” data items in a message into multiple memory locations and “gather” data items from multiple memory locations into one message • Network performance: • Latency: The time from when a Send is initiated until the first byte is received by a Receive. • Bandwidth: The rate at which a sender is able to send data to a receiver. CS 519: Operating System Theory
Scatter-Gather Scatter (Receive) Gather (Send) … … Message Message Memory Memory CS 519: Operating System Theory
Basic Message Passing: Easy, Right? • What can be easier than this, right? • Well, think of the post office: to send a letter CS 519: Operating System Theory
Basic Message Passing: Not So Easy • Why is it so complicated to send a letter if basic message passing is so easy? • Well, it’s really not easy! Issues include: • Naming:How to specify the receiver? • Routing:How to forward the message to the correct receiver through intermediaries? • Buffering: What if the out port is not available? What if the receiver is not ready to receive the message? • Reliability:What if the message is lost in transit? What if the message is corrupted in transit? • Blocking: What if the receiver is ready to receive before the sender is ready to send? CS 519: Operating System Theory
M M M M S R S R M S R Traditional Message Passing Implementation • Kernel-based message passing: unnecessary data copying and traps into the kernel S R CS 519: Operating System Theory
Reliability • Reliability problems: • Message loss • Most common approach: If don’t get a reply/ack msg within some time interval, resend • Message corruption • Most common approach: Send additional information (e.g., error correction code) so receiver can reconstruct data or simply detect corruption, if part of msg is lost or damaged. If reconstruction is not possible, throw away corrupted msg and pretend it was lost • Lack of buffer space • Most common approach: Control the flow and size of messages to avoid running out of buffer space CS 519: Operating System Theory
Reliability • Reliability is indeed a very hard problem in large-scale networks such as the Internet • Network is unreliable • Message loss can greatly impact performance • Mechanisms to address reliability can be costly even when there’s no message loss • Reliability is not as hard for parallel machines • Underlying network hardware is much more reliable • Less prone to buffer overflow, cause often have hardware flow-control Address reliability later, for loosely coupled systems CS 519: Operating System Theory
Computation vs. Communication Cost • 200 MHz clock 5 ns instruction cycle • Memory access: • L1: ~2-4 cycles 10-20 ns • L2: ~5-10 cycles 25-50 ns • Memory: ~50-200 cycles 250-1000 ns • Message roundtrip latency: • ~20 s • Suppose 75% hit ratio in L1, no L2, 10 ns L1 access time, 500 ns memory access time average memory access time 132.5 ns • 1 message roundtrip latency = 151 memory accesses CS 519: Operating System Theory
Performance … Always Performance! • So … obviously, when we talk about message passing, we want to know how to optimize for performance • But … which aspects of message passing should we optimize? • We could try to optimize everything • Optimizing the wrong thing wastes precious resources, e.g., optimizing leaving mail for the mail-person does not increase overall “speed” of mail delivery significantly • Subject of Martin et al., “Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture,” ISCA 1997. CS 519: Operating System Theory
Martin et al.: LogP Model CS 519: Operating System Theory
Sensitivity to LogGP Parameters • LogGP parameters: • L = delay incurred in passing a short msg from source to dest • o = processor overhead involved in sending or receiving a msg • g = min time between msg transmissions or receptions (msg bandwidth) • G = bulk gap = time per byte transferred for long transfers (byte bandwidth) • Workstations connected with Myrinet network and Generic Active Messages layer • Delay insertion technique • Applications written in Split-C but perform their own data caching CS 519: Operating System Theory
Sensitivity to Overhead CS 519: Operating System Theory
Sensitivity to Gap CS 519: Operating System Theory
Sensitivity to Latency CS 519: Operating System Theory
Sensitivity to Bulk Gap CS 519: Operating System Theory
Summary • Runtime strongly dependent on overhead and gap • Strong dependence on gap because of burstiness of communication • Not so sensitive to latency can effectively overlap computation and communication with non-blocking reads (writes usually do not stall the processor) • Not sensitive to bulk gap got more bandwidth than we know what to do with CS 519: Operating System Theory
What’s the Point? • What can we take away from Martin et al.’s study? It’s extremely important to reduce overhead because it may affect both “o” and “g” All the “action” is currently in the OS and the Network Interface Card (NIC) • Subject of von Eicken et al., “Active Message: a Mechanism for Integrated Communication and Computation,” ISCA 1992. CS 519: Operating System Theory
An Efficient Low-Level Message Passing Interface von Eicken et al., “Active Messages: a Mechanism for Integrated Communication and Computation,” ISCA 1992 von Eicken et al., “U-Net: A User-Level Network Interface for Parallel and Distributed Computing,” SOSP 1995 Santos, Bianchini, and Amorim, “A Survey of Messaging Software Issues and Systems for Myrinet-Based Clusters”, PDCP 1999
von Eicken et al.: Active Messages • Design challenge for large-scale multiprocessor: • Minimize communication overhead • Allow computation to overlap communication • Coordinate the above two without sacrificing processor cost/performance • Problems with traditional message passing: • Send/receive are usually synchronous; no overlap between communication and computation • If not synchronous, needs buffering (inside the kernel) on the receive side • Active Messages approach: • Asynchronous communication model (send and continue) • Message specifies handler that integrates msg into on-going computation on the receiving side CS 519: Operating System Theory
Buffering • Remember buffering problem: what to do if receiver not ready to receive? • Drop the message • This is typically very costly because of recovery costs • Leave the message in the NIC • Reduce network utilization • Can result in deadlocks • Wait until receiver is ready – synchronous or 3-phase protocol • Copy to OS buffer and later copy to user buffer CS 519: Operating System Theory
3-phase Protocol CS 519: Operating System Theory
Incoming Message Copying Process Address Space Message Buffers OS Address Space CS 519: Operating System Theory
Copying - Don’t Do It! Hennessy and Patterson, 1996 CS 519: Operating System Theory
Overhead of Many Native MIs Too High • Recall that overhead is critical to appl performance • Asynchronous send and receive overheads on many platforms (back in 1991): Ts = time to start a message; Tb = time/byte; Tfb = time/flop (for comparison) CS 519: Operating System Theory
Message Latency on Two Different LAN Technologies CS 519: Operating System Theory
von Eicken et al.: Active Receive • Key idea is really to optimize receive - Buffer management is more complex on receiver Handler Message Data CS 519: Operating System Theory
Active Receive More Efficient P1 P0 Active Message P1 P0 Copying P1 P0 OS OS CS 519: Operating System Theory
Active Message Performance Send Receive Instructions Time ( m s) Instructions Time ( m s) NCUBE2 21 11.0 34 15.0 CM-5 1.6 1.7 Main difference between these AM implementations is that the CM-5 allows direct, user-level access to the network interface. More on this in a minute! CS 519: Operating System Theory
Any Drawback To Active Message? • Active message SPMD • SPMD: Single Program Multiple Data • This is because sender must know address of handler on receiver • Not absolutely necessary, however • Can use indirection, i.e. have a table mapping handler Ids to addresses on receiver. Mapping has a performance cost, though. CS 519: Operating System Theory
User-Level Access to NIC • Basic idea: allow protected user access to NIC for implementing comm. protocols at user-level CS 519: Operating System Theory
User-level Communication • Basic idea: remove the kernel from the critical path of sending and receiving messages • user-memory to user-memory: zero copy • permission is checked once when the mapping is established • buffer management left to the application • Advantages • low communication latency • low processor overhead • approach raw latency and bandwidth provided by the network • One approach: U-Net CS 519: Operating System Theory
U-Net Abstraction CS 519: Operating System Theory
U-Net Endpoints CS 519: Operating System Theory
U-Net Basics • Protection provided by endpoints and communication channels • Endpoints, communication segments, and message queues are only accessible by the owning process (all allocated in user memory) • Outgoing messages are tagged with the originating endpoint address and incoming messages are demultiplexed and only delivered to the correct endpoints • For ideal performance, firmware at NIC should implement the actual messaging and NI multiplexing (including tag checking). Protection must be implemented by the OS by validating requests for the creation of endpoints. Channel registration should also be implemented by the OS. • Message queues can be placed at different memories to optimize polling • Receive queue allocated in host memory • Send and free queues allocated in NIC memory CS 519: Operating System Theory
U-Net Performance on ATM CS 519: Operating System Theory
U-Net UDP Performance CS 519: Operating System Theory
U-Net TCP Performance CS 519: Operating System Theory
U-Net Latency CS 519: Operating System Theory
Virtual Memory-Mapped Communication • Receiver exports the receive buffers • Sender must import a receive buffer before sending • The permission of sender to write into the receive buffer is checked once, when the export/import handshake is performed (usually at the beginning of the program) • Sender can directly communicate with the network interface to send data into imported buffers without kernel intervention • At the receiver, the network interface stores the received data directly into the exported receive buffer with no kernel intervention CS 519: Operating System Theory