180 likes | 294 Views
Infiniband architecture Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org) Potential improvements. Infiniband architecture overview. Infiniband architecture overview Components: Links
E N D
Infiniband architecture • Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org) • Potential improvements
Infiniband architecture overview • Components: • Links • Channel adaptors • Switches • Routers • The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. • Topology: • Irregular • Regular: Fat tree • Link speed: • 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X).
Layers: somewhat similar to TCP/IP • Physical layer • Link layer • Error detection (CRC checksum) • flow control (credit based) • switching, virtual lanes (VL), • forwarding table computed by subnet manager • Not adaptive • Network layer: across subnets. • No use for the cluster environment • Transport layer • Reliable/unreliable, connection/datagram • Verbs: interface between adaptors and OS/Users
Packet format: • Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet • Global Route Header (GRH): 40 Bytes. Used for routing between subnets • Base Transport header (BTH): 12 Bytes, for IBA transport • Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram • Datagram extended transport header (DETH): 8 bytes • RDMA extended transport header (RETH): 16 bytes • Atomic, ACK, Atomic ACK, • Immediate DATA extended transport header: 4 bytes, optimized for small packets. • Invalidate • Invariant CRC and variant CRC: • CRC for fields not changed and changed.
Local Route Header: • Switching based on the destination port address (LID) • Multipath switching by allocating multiple LIDs to one port
Local Route Header: • Switching based on the destination port address (LID) • Multipath switching by allocating multiple LIDs to one port • GRH: same format as IPV6 address (16 bytes address)
Verbs • OS/Users access the adaptor through verbs • Communication mechanism: Queue Pair (QP) • Support the four types of services, including reliable connection service • Each connection takes one QP on each end. • Each QP has a send queue and a receive queue. • Users can post send requests to the send queue and receive requests to the receive queue. • Three types of send operations: SEND, RDMA-(WRITE, READ, ATOMIC), MEMORY-BINDING • One receive operation (matching SEND)
Queue Pair: • The status of the result of an operation (send/receive) is stored in the complete queue. • Send/receive queues can bind to different complete queues. • Related system level verbs: • Open QP, create complete queue, Open HCA, open protection domain, register memory, allocate memory window, etc • User level verbs: • post send/receive request, poll for completion.
To communicate: • Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). • Post send/receive requests. • Check completion. • What if a packet arrives before a receive request is posted? • Not specified in the standard • The right response should be a ‘receiver not ready (RNR)’ error. The sender is back-pressed in this case.
Infiniband has a perfect software interface (Chien'94 paper): • The network subsystem realizes all user level functionality. • User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS. • Network supports in-order delivery and and fault tolerance. • Buffer management is pushed out to the user.
SilverStorm 9024: • 24 ports 4X(10Gbps) or 8 ports 12X(30 Gbps) • switch type: cut-through • switch latency: < 140ns • switch bandwidth: 480 Gbps • forwarding table size: 48K • VL support: 8 + 1 management
SilverStorm 9240: • 24 expansion slots, each expansion model 12 port 4X or 4 port 12X (24x12 = 288, 288 by 288 switch) • switch type cut-through • switch latency: < 140ns to < 420ns • switch bandwidth: 5.76Tbps • forwarding table size: 48K • VL support: 8 + 1 management
Potential improvements on Infiniband using compiled communication • Improving the internal Infiniband fabric: • Offline routing for static pattern (static SM for a reduced traffic pattern) can be beneficial for irregular networks. • Simplify the layer architecture by having a direct link model (for known patterns), the header can be simplified, may not matter much (Infiniband layers are thin). • Simplify the protection mechanism. • Circuit switch type Infiniband. • Reliable communication protocol is still needed. • Potential benefits can be evaluated by simulation.
Improving the messaging software (software to hardware interface): no chance. • Improving the MPI implementation over Infiniband: similar to our current work on Ethernet • Message scheduling for collective/point-to-point communications based on the network topology. • Exploring NIC features (buffers in NIC, multicast) • Reducing the number of instructions in a library routine makes sense. Compiled communication can be used to optimize the MPI library. • Compiled communication can help improving the library implementation (e.g. reducing the number of message copies, early requests posting , using RDMA, etc).
One particular project: • Design algorithms for Infiniband subnet manager • Improving routing performance for Infiniband subnet manager (SM). • Objective: minimize the maximum channel load for an given traffic pattern • Optimize according to a given pattern: the traffic pattern in an application is usually not all-to-all • Default routing used in IBA SM • For a sparse traffic pattern, the maximum channel load can usually be minimized using the minimim interference principle. • Need to extend minimum interference routing for load balance deadlock free routing. • The best way to realize IBA SM is still not clear (unknown) at this time, we can probably do something here. • Irregular network or Fat tree network