430 likes | 510 Views
Send and Receive Based Message-Passing for SCMP. Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28 th , 2004. A. B. Thread. A. Data. B. Sync. A. B. RTS. A. B. CTS. A. B. Data.
E N D
Send and Receive Based Message-Passing for SCMP Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28th, 2004
A B Thread A Data B Sync A B RTS A B CTS A B Data This presentation introduces the SCMP architecture, discusses problems with the current SCMP message-passing system, and focuses on the design and performance of a new SCMP message-passing system. 1. Overview of SCMP 2. Original Message-Passing System 4. Performance Comparisons 3. New Message-Passing System
Problems with current design trends motivate the SCMP concept. • As transistor sizes shrink, so do communication wires. This leads to higher cross-chip communication latencies. • ILP faces diminishing returns. • Large and complex uni-processors require extensive amounts of design and verification.
SCMP provides PLP through replication. • Up to 64 identical nodes on-chip • Replicated nodes reduce complexity • 2-D network eliminates cross-chip wires SCMP Network with 64 Nodes
SCMP provides TLP through multi-thread hardware support. • Up to 16 threads • Round-robin thread scheduling by hardware • On every node: • 4-stage RISC pipeline • 8MB memory • Networking hardware SCMP Node
The original messaging system has two message types. Thread Message Data Message Because they contain handling information these message formats borrow from the Active-Messages message-passing system.
Network uses wormhole and dimension-order routing. 0 1 2 3 4 5 6 7 • Every router multiplexes virtual channel buffers over physical channels. • Head flits claim virtual channel resources as they travel • If one message blocks, other messages may still continue as long as enough virtual channels are free. • Messages move along X axis, then Y axis • Tail flits release virtual channel resources as they travel.
Dimension-order routing is deadlock free as long as messages eventually drain. • Even with VCs, network can still deadlock if messages don’t drain. • If all contexts are consumed, thread messages block at NIU • Threads may not release until a data message is received • Data messages must not be stopped by congested thread messages • Data messages must have a separate path through network. Router Thread VCs West East Data VCs
The NIU bears most of the messaging load. NIU Thread Buffer Context 1 Context 2 Injection Channel Data Buffer Context 2 To Router From Router Receive Buffer Ejection Channel Memory
The original message-passing system uses requests and replies. Node A requires data held by Node B • Node A creates a thread on Node B • New thread on Node B sends data to Node A • New thread on Node B sends SYNC message when done A B Thread A B Data Sync
Dynamic memory is a problem. • Request thread on node B must know: • Source Address • Source Stride • Destination Address • Destination Stride • Number of Values to Send • How can Node A know the source address and stride if Node B allocates the buffer dynamically? • Program must contain global pointers
In-order delivery of messages is a problem. • SCMP network does not guarantee in-order delivery of messages • SYNC message may reach Node A before data message • Node A will read bad values from memory A B Data Sync
Request threads and finite thread contexts are a problem. Contexts 0X0000de5a 0X00000f70 0X00000ff8 0X00000ff8 NIU 0X00000ff8 Thread Thread Thread 0X00000ff8 0X00000ff8 • If a node holds highly demanded data, request threads may consume all of its contexts • Additional thread messages will block in the network 0X00000ff8
Send-and-Receive message-passing eliminates all of these problems. • A thread must execute a receive before data will be accepted • Don’t need request messages • Messages are identified abstractly • Don’t need global pointers • Completion notification occurs locally • Don’t need SYNC messages
Rendezvous mode uses an RTS/CTS handshake. Node B holds data required by Node A • Node B sends Node A an RTS message when send is executed • After receive is executed Node A sends Node B a CTS message • Node B sends data after receiving RTS A B RTS A B CTS A B Data
Ready mode foregoes the handshake to reduce message latency. Node B holds data required by Node A • Node B sends data when send is executed • User must ensure that receive has executed on Node A A B Data
The implementation centers around two tables. Send Table Entry Receive Table Entry
Send Table Entries may be in 4 states, and Receive Table Entries may be in 5 states. Send Table Entry States Receive Table Entry States
The new messaging system has four message types. Data Message Thread Message CTS Message RTS Message
The NIU now contains a data queue for every context. NIU Thread Buffer Injection Channel To Router Data Buffer Context 1 Context 2 Context 2 RTS Buffer From Router CTS Buffer Receive Buffer Ejection Channel Memory
Only five new instructions and one modified instruction are needed.
Rendezvous Mode Operation at the Sender sendh No Entry? F SUSPEND CTS Message Arrives T Queue Head And Tag Queue Waiting F Create: Entry->In Use ERROR T Head Flit @ Queue Head Tail Flit Not Sent Send Flit No Entry? T ERROR Entry->Complete F Send RTS Entry->In Progress
Rendezvous Mode Operation at the Receiver RTS Message Arrives Data Message Arrives No Entry No Entry T T DISCARD Record RTS F Entry->RTS Rcvd In Progress F F Block Data In Use T Send CTS T Tail Flit Not Stored F Entry->In Progress Store Flit Block RTS Entry->Complete RTS Rcvd No Entry F F SUSPEND str T T Record str Send CTS Entry->In Use Entry->In Progress
RTS and CTS Messages also need separate VC paths. Router • RTS messages can block in the network. • For a given RTS message to leave the network, RTS messages ahead of it must be satisfied • CTS message to source • Data message back • RTS and CTS messages have their own VC paths. Thread VCs Data VCs West East RTS VCs CTS VCs
Ready Mode Operation at the Sender Head Flit @ Queue Head F sendh No Entry? ERROR No Entry? F T SUSPEND Entry->In Progress T Queue Head And Tag Tail Flit Not Sent Send Flit Create: Entry->In Use Entry->Complete
Data Message Arrives str No Entry T DISCARD No Entry F SUSPEND F In Progress F T Block Data Record str T Entry->In Use Tail Flit Not Stored Store Flit Entry->Complete Ready Mode Operation at the Receiver
Stressmark testing was used to verify that performance was not hurt. • DIS Stressmark Suite • Neighborhood Stressmark • Matrix Stressmark • Transitive Closure Stressmark • LU Factorization Stressmark
The neighborhood stressmark measures image texture. • Every node owns a portion of the total rows • Every row owns complete sum and difference histograms • Each node determines, and requests, the pair’s for pixels in its rows • Each node fills in sum and difference histogram • Histograms are shared • Each node manages only a portion of each histogram • Only the correct portion is sent to a node
The new system out performs the old under the neighborhood stressmark.
Matrix stressmark solves a linear system of equations using the Conjugate Gradient Method. • Additional vectors r and p used for intermediate steps • Every node has: • Rows of A • Elements of b and r • Complete x and p • After each iteration p must be globally redistributed • Share with columns • Share with rows
The new system provides marginal improvement over the original under the matrix stressmark.
1 2 0 3 4 14 13 5 6 11 12 7 8 9 15 10 The transitive closure stressmark solves the all-pairs shortest-path problem. • Floyd-Warshall Algorithm • Adjacency Matrix • D[i][j] • Iterative Improvements • D[i][j] = min(D[i][j], D[i][k]+D[k][j]) • Each node owns sub-block of adjacency matrix • Each node needs portion of row k • Each node needs portion of column k
The new system provides marginal improvement over the original under the transitive closure stressmark.
The LU factorization stressmark is used by linear system solvers. • Factors matrix into a lower triangular matrix and an upper triangular matrix. • Matrix is divided into blocks • Pivot block is factored • Pivot column and row blocks are divided by pivot. • Inner active matrix blocks are modified by the pivot row and column blocks. Pivot Row Pivot Pivot Column Inner Active Matrix
The new system out performs the original under the LU factorization stressmark.
Send-and-Receive Messaging for SCMP is worthwhile. • Fixes Problems With Original SCMP Messaging System • Global Buffer Pointers • Races between Data and SYNC messages • Request Thread Storms • Programming Model is more familiar • Performance is better Questions?