1 / 43

Send and Receive Based Message-Passing for SCMP

Send and Receive Based Message-Passing for SCMP. Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28 th , 2004. A. B. Thread. A. Data. B. Sync. A. B. RTS. A. B. CTS. A. B. Data.

tanner
Download Presentation

Send and Receive Based Message-Passing for SCMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Send and Receive Based Message-Passing for SCMP Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28th, 2004

  2. A B Thread A Data B Sync A B RTS A B CTS A B Data This presentation introduces the SCMP architecture, discusses problems with the current SCMP message-passing system, and focuses on the design and performance of a new SCMP message-passing system. 1. Overview of SCMP 2. Original Message-Passing System 4. Performance Comparisons 3. New Message-Passing System

  3. Problems with current design trends motivate the SCMP concept. • As transistor sizes shrink, so do communication wires. This leads to higher cross-chip communication latencies. • ILP faces diminishing returns. • Large and complex uni-processors require extensive amounts of design and verification.

  4. SCMP provides PLP through replication. • Up to 64 identical nodes on-chip • Replicated nodes reduce complexity • 2-D network eliminates cross-chip wires SCMP Network with 64 Nodes

  5. SCMP provides TLP through multi-thread hardware support. • Up to 16 threads • Round-robin thread scheduling by hardware • On every node: • 4-stage RISC pipeline • 8MB memory • Networking hardware SCMP Node

  6. The original messaging system has two message types. Thread Message Data Message Because they contain handling information these message formats borrow from the Active-Messages message-passing system.

  7. Network uses wormhole and dimension-order routing. 0 1 2 3 4 5 6 7 • Every router multiplexes virtual channel buffers over physical channels. • Head flits claim virtual channel resources as they travel • If one message blocks, other messages may still continue as long as enough virtual channels are free. • Messages move along X axis, then Y axis • Tail flits release virtual channel resources as they travel.

  8. Dimension-order routing is deadlock free as long as messages eventually drain. • Even with VCs, network can still deadlock if messages don’t drain. • If all contexts are consumed, thread messages block at NIU • Threads may not release until a data message is received • Data messages must not be stopped by congested thread messages • Data messages must have a separate path through network. Router Thread VCs West East Data VCs

  9. The NIU bears most of the messaging load. NIU Thread Buffer Context 1 Context 2 Injection Channel Data Buffer Context 2 To Router From Router Receive Buffer Ejection Channel Memory

  10. Messages are built through assembly instructions.

  11. The thread library facilitates thread messages.

  12. The send library facilitates data messages.

  13. The original message-passing system uses requests and replies. Node A requires data held by Node B • Node A creates a thread on Node B • New thread on Node B sends data to Node A • New thread on Node B sends SYNC message when done A B Thread A B Data Sync

  14. Dynamic memory is a problem. • Request thread on node B must know: • Source Address • Source Stride • Destination Address • Destination Stride • Number of Values to Send • How can Node A know the source address and stride if Node B allocates the buffer dynamically? • Program must contain global pointers

  15. In-order delivery of messages is a problem. • SCMP network does not guarantee in-order delivery of messages • SYNC message may reach Node A before data message • Node A will read bad values from memory A B Data Sync

  16. Request threads and finite thread contexts are a problem. Contexts 0X0000de5a 0X00000f70 0X00000ff8 0X00000ff8 NIU 0X00000ff8 Thread Thread Thread 0X00000ff8 0X00000ff8 • If a node holds highly demanded data, request threads may consume all of its contexts • Additional thread messages will block in the network 0X00000ff8

  17. Send-and-Receive message-passing eliminates all of these problems. • A thread must execute a receive before data will be accepted • Don’t need request messages • Messages are identified abstractly • Don’t need global pointers • Completion notification occurs locally • Don’t need SYNC messages

  18. Rendezvous mode uses an RTS/CTS handshake. Node B holds data required by Node A • Node B sends Node A an RTS message when send is executed • After receive is executed Node A sends Node B a CTS message • Node B sends data after receiving RTS A B RTS A B CTS A B Data

  19. Ready mode foregoes the handshake to reduce message latency. Node B holds data required by Node A • Node B sends data when send is executed • User must ensure that receive has executed on Node A A B Data

  20. The implementation centers around two tables. Send Table Entry Receive Table Entry

  21. Send Table Entries may be in 4 states, and Receive Table Entries may be in 5 states. Send Table Entry States Receive Table Entry States

  22. The new messaging system has four message types. Data Message Thread Message CTS Message RTS Message

  23. The NIU now contains a data queue for every context. NIU Thread Buffer Injection Channel To Router Data Buffer Context 1 Context 2 Context 2 RTS Buffer From Router CTS Buffer Receive Buffer Ejection Channel Memory

  24. Only five new instructions and one modified instruction are needed.

  25. The thread library remains nearly the same.

  26. The new send library is more familiar.

  27. The receive library is all new.

  28. Rendezvous Mode Operation at the Sender sendh No Entry? F SUSPEND CTS Message Arrives T Queue Head And Tag Queue Waiting F Create: Entry->In Use ERROR T Head Flit @ Queue Head Tail Flit Not Sent Send Flit No Entry? T ERROR Entry->Complete F Send RTS Entry->In Progress

  29. Rendezvous Mode Operation at the Receiver RTS Message Arrives Data Message Arrives No Entry No Entry T T DISCARD Record RTS F Entry->RTS Rcvd In Progress F F Block Data In Use T Send CTS T Tail Flit Not Stored F Entry->In Progress Store Flit Block RTS Entry->Complete RTS Rcvd No Entry F F SUSPEND str T T Record str Send CTS Entry->In Use Entry->In Progress

  30. RTS and CTS Messages also need separate VC paths. Router • RTS messages can block in the network. • For a given RTS message to leave the network, RTS messages ahead of it must be satisfied • CTS message to source • Data message back • RTS and CTS messages have their own VC paths. Thread VCs Data VCs West East RTS VCs CTS VCs

  31. Ready Mode Operation at the Sender Head Flit @ Queue Head F sendh No Entry? ERROR No Entry? F T SUSPEND Entry->In Progress T Queue Head And Tag Tail Flit Not Sent Send Flit Create: Entry->In Use Entry->Complete

  32. Data Message Arrives str No Entry T DISCARD No Entry F SUSPEND F In Progress F T Block Data Record str T Entry->In Use Tail Flit Not Stored Store Flit Entry->Complete Ready Mode Operation at the Receiver

  33. Stressmark testing was used to verify that performance was not hurt. • DIS Stressmark Suite • Neighborhood Stressmark • Matrix Stressmark • Transitive Closure Stressmark • LU Factorization Stressmark

  34. The neighborhood stressmark measures image texture. • Every node owns a portion of the total rows • Every row owns complete sum and difference histograms • Each node determines, and requests, the pair’s for pixels in its rows • Each node fills in sum and difference histogram • Histograms are shared • Each node manages only a portion of each histogram • Only the correct portion is sent to a node

  35. Queues with 16 flits perform best.

  36. The new system out performs the old under the neighborhood stressmark.

  37. Matrix stressmark solves a linear system of equations using the Conjugate Gradient Method. • Additional vectors r and p used for intermediate steps • Every node has: • Rows of A • Elements of b and r • Complete x and p • After each iteration p must be globally redistributed • Share with columns • Share with rows

  38. The new system provides marginal improvement over the original under the matrix stressmark.

  39. 1 2 0 3 4 14 13 5 6 11 12 7 8 9 15 10 The transitive closure stressmark solves the all-pairs shortest-path problem. • Floyd-Warshall Algorithm • Adjacency Matrix • D[i][j] • Iterative Improvements • D[i][j] = min(D[i][j], D[i][k]+D[k][j]) • Each node owns sub-block of adjacency matrix • Each node needs portion of row k • Each node needs portion of column k

  40. The new system provides marginal improvement over the original under the transitive closure stressmark.

  41. The LU factorization stressmark is used by linear system solvers. • Factors matrix into a lower triangular matrix and an upper triangular matrix. • Matrix is divided into blocks • Pivot block is factored • Pivot column and row blocks are divided by pivot. • Inner active matrix blocks are modified by the pivot row and column blocks. Pivot Row Pivot Pivot Column Inner Active Matrix

  42. The new system out performs the original under the LU factorization stressmark.

  43. Send-and-Receive Messaging for SCMP is worthwhile. • Fixes Problems With Original SCMP Messaging System • Global Buffer Pointers • Races between Data and SYNC messages • Request Thread Storms • Programming Model is more familiar • Performance is better Questions?

More Related