490 likes | 504 Views
Myrinet is a cost-effective SAN that interconnects clusters, offering high-performance packet communication and switching technology. It supports sustained data rates of up to 245 MB/s and has low latency for short messages. The LANai chip and custom VLSI chip control data transfer, and routing is done through wormhole switching. Myrinet allows for concurrent packets of different types or protocols, and offers back-pressure flow control and deadlock prevention.
E N D
Active Message Implementation CS 63995 - Networks of Embedded Systems Raquel S. Whittlesey-Harris
Myrinet • Developed by Myricom • Based on earlier research projects • Mosaic • Atomic LAN - Research Prototype • Implemented network mapping and address-to-route translation • Not suitable for larger systems • 1 m distance limitation • 1 D chain topology limitation
Myrinet • A SAN widely used to interconnect Clusters • Cost Effective • $1300 for a 64 bit/66MHz SAN/PCI interface card • High Performance packet communication and switching technology • Performance measurements for a 666MHz P-III with 64-bit/66MHz PCI (PC164C) Interface card (results based on message passing performance in application programs)
Myrinet • Sustained one-way data rate for large packets • 245 MB/s (1.96 Gb/s) • Latency for short messages • 7 us • Host CPU utilization per message send • .3 us • Host CPU utilization per message receive • .75 us
Myrinet • Host Interface Design • Consist of: • LANai chip • Custom VLSI chip which controls the data transfer between the host and the network • Main component - programmable microcontroller controls DMA engines responsible for data transfer directions (host onboard memory/ memory network) • Data must be written to NIC SRAM before it can be put into the network
Myrinet • LANai monitors network status and mapping • SRAM memory • SRAM ranges in size of 512 KB to 1 MB • Stores MCP (Myrinet Control Program) and several job queues (used for communications between the LANai and host device drivers or user-level libraries)
Myrinet • Data Packets • Routed using wormhole switching • Intermediate switches forward worms to output ports without waiting for the entire worm to be assembled • Worm can stretch over several nodes and links at any one time • May be of any length ??? • Max size of worm is 9KB (limit imposed by LANai control program - 1996) • Can encapsulate other types of packets (e.g., IP) without the adaptation layer
Myrinet • Packets consists of: • Routing header • Type field - may customize type (MCP) • Payload • CRC • Identified by type • Carry packets of many types or protocols concurrently
Myrinet • Special control symbols are used for back-pressure flow control (STOP, GO), resetting the network (FRES), and backward reset (BRES) • Worms are stored in slack buffers of upstream nodes when output ports are unavailable • possibility of deadlock and thus dropped packets which is unacceptable • Source Path routing is used • Source computes route which is included in header • Output ports are stripped off header before routing the packet through the port
Myrinet • New checksums are appended to end of worm • Switches • Up to 16-port crossbar switches • Any network topology can be created • Ports automatically detect the absence/presence of a link • On start-up, the host interfaces automatically determine the network topology
AM Implementation on Myrinet and Solaris 2.x • AM - Four Layers • AM Library • Exports the programming interface to applications • Firmware - LCP (LANai Control Program) • In embedded processor on the NIC • Network Protocol Processing • Provides reliable and unduplicated message delivery between NICs
AM Implementation on Myrinet and Solaris 2.x • Virtual Network Segment Driver (see Figure below) • Abstracts network interfaces and communication resources • Processor and interconnection hardware
AM Implementation on Myrinet and Solaris 2.x • Endpoint structures reside on the NIC and host memories • NIC • LANai processor processes network protocol • AM library requests services from the LANai processor using the endpoint queues • Endpoint frames (in limited number) are mapped onto the application address space for direct access to NIC
AM Implementation on Myrinet and Solaris 2.x • Virtual Network Driver permits more endpoints than can be accommodated by the NIC • Segment driver framework of Solaris VM Layer is used to swap pages into the host memory when not utilized and cache active endpoints into the NIC memory • Transport operations read/write message headers directly from the NIC memory (see Fig. 11.6, [2]) • Firmware and the segment driver communicate with a “system endpoint” • Always pinned
AM Implementation on Myrinet and Solaris 2.x • Firmware queues requests in the system endpoint’s queue and interrupts the driver • Each provided with a unique identifying number (nic_number) • Firmware maintains routes for itself to every other destination NIC in the network • Shared Memory Key is required for the source to map to the receive pool of the destination endpoint • Nic_number, endpoint_number, and shared memory key forms the endpoint name structure en_t
AM Implementation on Myrinet and Solaris 2.x • Host • Shared memory protocol is processed • A kernel thread created by the driver services requests by the NIC firmware to the driver • Interrupt handler wakes up the kernel thread to service requests
AM Implementation on Myrinet and Solaris 2.x • Endpoint Main Components (see Fig. 11.7 [2]) • Control Block • Translation Table • Handler Table • Tag • Shared Memory Queue Block (Fig. 11.8 [2]) • Shared Memory Protocol • Directly deposits the message in the receiver’s message receive queue • Uses System V IPC Layer
AM Implementation on Myrinet and Solaris 2.x • AM_Map() - receive queue points onto shared memory address space • Identifier required for mapping is made available through the endpoint name • Receive queues only • Shared memory transfer writes directly onto the receive queue of the recipient via shared memory mappings
AM Implementation on Myrinet and Solaris 2.x • Network Queue Block • Located in the NIC Memory • Medium Staging Area • Kernel memory • DMAble • Request and Replies are placed in separate queues (send & request pools) • Resident Queue - In NIC memory • Non-Resident Queue - In host memory
AM Implementation on Myrinet and Solaris 2.x • 16 descriptors for each message pool • Hold firmware information needed to deliver messages • Send Pool • Host maintains the tail offset (producer) • NIC maintains the head offset (consumer) • Receive Pool • Host maintains the head offset (consumer) • NIC maintains the tail offset (producer)
AM Implementation on Myrinet and Solaris 2.x • Queues • Types • Packet Queue (handler index and arguments) • Bulk Data Queue (data for bulk transfers) • Three partitions • Head information (accessed by receivers) • Tail information (accessed by senders) • Two FIFO data queues
AM Implementation on Myrinet and Solaris 2.x • Transport • Sending an AM • AM Layer decides to use Shared Memory or Network Protocol • Comparison between nic_numbers and names of source and destination endpoints • Short Messages are sent using Programmed I/O only • Message and payload fit in one packet descriptor
AM Implementation on Myrinet and Solaris 2.x • Medium and Large Messages use the medium staging area for a copy of the user buffer • DMA is used to fetch data into NIC memory • To queue a short message, the packet tail is incremented using the compare and swap (CAS) instruction • Change type from free to claimed • If claim fails, queue is full • sender backs off exponentially and polls for messages to prevent deadlock
AM Implementation on Myrinet and Solaris 2.x • For bulk data, a bulk data block is claimed before obtaining a packet assignment • Marked as ready-bulk • After the packet and data is sent • Packet marked as free • Data block marked as invalid • May receive packets while sending • Accomplished through polling • Added overhead
AM Implementation on Myrinet and Solaris 2.x • Receiving an AM • Polling at receive pools • Handler indicated by the descriptor is invoked • For medium messages, a pointer to the receive buffer in the MSA is passed as an argument to the handler • For large messages, the data is copied onto the indicated offset in the VM segment • Descriptor is freed after execution of the handler
AM Implementation on Myrinet and Solaris 2.x • NIC Firmware • Key issues for LCP (LANai Control Program) • Scheduling of outgoing traffic • Weighted round-robin policy which focuses on active endpoints is used • Empty endpoints are skipped • Active endpoints has 2k attempts to send (k=7 currently); loitering if an endpoint should empty
AM Implementation on Myrinet and Solaris 2.x • Flow control mechanisms and policies • Independent logical channels are maintained by implementing a send and receive table • Row index corresponds to the destination/source nic_number • Column index corresponds to the channel • Sequence number, a pointer to the packet, and time packet was sent is kept for potential retransmission • The expected sequence number is kept for the purpose of matching (an ACK is sent upon receipt of proper packet - Stop and Wait protocol)
AM Implementation on Myrinet and Solaris 2.x • Timer management for packet retransmissions • A timer event is scheduled upon sending of a packet • Receiving an ACK deletes the event • All send table entries are scanned periodically for packet retransmit • 255 retransmissions allowed • Message declared undeliverable if no ACKs or NACKs received (destination endpoint unreachable)
AM Implementation on Myrinet and Solaris 2.x • Detecting and recovering from errors • Erroneous packets are dropped • Relies upon timeout and retransmission (no ACK) • Event handling • Firmware maintains event mask • Set and reset event mask request are implemented by the driver ( function ioctl()) and given to the NIC through the endpoint request queue
AM Implementation on Myrinet and Solaris 2.x • Upon AM_Wait(), ioctl() is called and the driver blocks on the condition • Upon event, • The NIC interrupts the driver reporting the event occurrence, • The kernel sends a wake-up signal to the application thread that was blocked • For shared memory protocol, the sender has to wake up the receiver • Receiver will set a flag in shared memory before blocking on an event
AM Implementation on Myrinet and Solaris 2.x • Sender wakes up the receiver after checking if the flag is set (may directly call ioctl() or request the NIC wake up the receiver
AM on TOS • Because TOS is designed to respond to events (event-based programming model), AM seems to be a good fit as a communications framework • AM supplies event based primitives • Avoids busy-waiting for data arrival • Supports concurrent communications and computation which is desired in the constrained environment of tiny network devices
AM on TOS • Low overhead • Complements the limited resources of tiny network devices • Stack space used on blocking • Distributed event model • Nodes may send events to other nodes
AM on TOS • TinyOS messaging component • Commands from accepted from applications to initiate message transmissions • Events are “fired off” to message handlers • Based on the type of message received • An event signals the completion of a transmission
AM on TOS • TOS Commands • Send Commands • Destination address • Handler ID • Message body
AM on TOS • AM Component • Handles address checking and dispatch • Relies on sub components for basic packet transmission • Radio • Serial
AM on TOS • Packet Component • Interface provides transmission of fixed-size, 30 byte packets • Two events are fired • Upon completion of transmission • Upon completion of reception
AM on TOS • Packet Format
AM on TOS • Packet Transmission • If R0 matches the local address, the handler is invoked and the remaining 28 bytes are passed • Handler dispatch routines are generated at compile time to reduce the overhead of handler registration • Handler 0 - routing handler • Handler 255 - lists installed handlers
AM on TOS • Invalid handler Ids cause the message to be ignored • Source based multi-hop routing is supported • Maximum of 4 hop communication • R1 - R4 are used to hold nodes addresses • N is used to identify the number of hops • S is used to identify the source node • HF is used to identify the handler id of the destination node
AM on TOS • H0 is set to zero once the packet is in route • Receipt of packet • The hop count (N) is decremented • The next hop (R0) is rotated and the local address is pushed to the end • Records the route the packet took • Destination handler (HF) is placed into H0 if the next hop is the final destination
AM on TOS • Special Addresses • Broadcast address • One to all • UART • Packet is forwarded to the UART rather than radio by device receiving the packet
AM on TOS • Routing • Two type of messages • Route update • Performs the function of recording the received information in the routing table and initiating the retransmission of the propagated route update message • Data collection • Responds to the receipt of a packet that needs to be forwarded towards the base station
AM on TOS • Checks the routing table • Updates the payload of the packet to record that it is transitioned through the local node • Sends the packet toward the recipient stored in the routing table • Clock event • periodically triggers the node to begin collecting data from senors and transmit data towards the base station
References • A. Mainwaring and D. Culler, “Active Message Applications Programming Interface and Communication Subsystem Organization”, Computer Science Division, University of California at Berkeley, Draft Technical Report, 1995. • N. Parab and M. Raghvendran, “Active Messages”, Center for Development of Advanced Computing, Bangalore, India. • P. Buonadonna, J. Hill and D. Culler, “Active Message Communication for Tiny Networked Senors”, Computer Science Division, University of California at Berkeley. • N. Parab and M. Raghvendran, “Myrinet”, Center for Development of Advanced Computing, Bangalore, India.
References • “Myrinet Overview”, html://www.myri.com/myrinet/overview/index.html • “Myrinet Performance Measurements”, html://www.myri.com/myrinet/performance/index.html • “Myrinet Protocols”, html://www.cs.ucla.edu/~simonw/sigcom/node2.html