1 / 30

Generalized Data Transfers At Memory Bandwidth

Generalized Data Transfers At Memory Bandwidth. Peter A. Dinda David R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh. Generalized Data Transfers. Sending Node Memory. Receiving Node Memory. A. D. B. E. C. F. Address Relations.

gigi
Download Presentation

Generalized Data Transfers At Memory Bandwidth

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generalized Data TransfersAt Memory Bandwidth Peter A. DindaDavid R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh

  2. Generalized Data Transfers Sending Node Memory Receiving Node Memory A D B E C F

  3. Address Relations Sending Node Memory Receiving Node Memory D A B E C F {(A,F),(B,D),(C,E)} R={(x,y) | data item at address x on sender is copied to address y on receiver}

  4. Send/Recv Implementation Receiving Node Memory Sending Node Memory A D B E C {(A,F), (B,D), (C,E)} F Message Disassembly Message Assembly Message Contents Data Transfer (also put and get communication models)

  5. Storing Address Relations Compute Address Relation - “Inspector” Done Once while not done compute_address_pair(x,y) store_address_pair(x,y) end while Assemble Message - “Executor” while not done get_address_pair(x,y) buffer[i++]=data[x] end while Repeated Many Times

  6. Inspector/Executor [Salz, et al] In-line Computation Inspector/Executor i=1 Inspector i=1 do i=1,1000 call Work() call COPY() call Work() enddo Executor i=2 i=2 Executor i=3 i=3 Executor i=3 Executor

  7. dim A(N,N),B(N,N) do i=1,1000 call Work(A) call Work(B) end Context: Array Assignments B=A Array A Array B Abstraction We concentrate on B=A and B=TRANSPOSE(A) More general forms exist

  8. Distributed Arrays Regular Block-cyclic distributions as in High Performance Fortran(HPF) (*,CYCLIC) (*,BLOCK) (*,CYCLIC(k)) Distribution Elements Processor 0 Owns Local Array on Processor 0

  9. Representative Assignments (BLOCK,*) (CYCLIC,*) (BLOCK,*) (*,BLOCK) (CYCLIC,*) (BLOCK,*) (*,CYCLIC) (CYCLIC,*) Data Transpose

  10. Representing Address Relations • General Purpose • Space Efficiency • Hardware Limited Performance • In-line expansion

  11. AAPAIR: Simple Representation Sending Node Memory Receiving Node Memory D A A F B E B D C F C E {(A,F),(B,D),(C,E)} Simple sequence of pointer pairs PROBLEM: Space Efficiency PROBLEM: Performance

  12. AABLK: Run-length Encoding D A 2 A F 2 B E B D 2 C E C F {(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)} Sequence of pointer, pointer, length triples PROBLEM: Strided Access

  13. DMRLE: Handling Strides D A 1 A F g h 2 B E g h g C F h {(A,F),(B,E),(C,D)} B-A = C-B = g E-F = D-E = h sequence of offset, offset, length triples PROBLEM: Repeated Strides

  14. DMRLEC: Repeated Strides D A E 0 1 2 1 h g B h 1 F 0: A F g C 2 1: g h v D’ 1 2: u v u A’ E’ h g B’ h F’ g C’ {(A,F),(B,E),(C,D), (A’,F’),(B’,E’),(C’,D’)} B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h A’-C = u and F’-D=v Sequence of indices into table of offset, offset, length triples

  15. Address Relation Storage Costs

  16. Copying & Superscalar Plateau Issued at time t Time load store load store load store store load ... stall stall stall stall Free Issue Slots ... p ... Plateau = np = 2*3= 6 n Maximum number of non load/store instructions before copy bandwidth suffers

  17. Paragon: No Superscalar Plat.

  18. Pentium 90: Clear Plateau

  19. DEC 3K/400a: Complex Plateau

  20. Measurement Details • Portable Library written in C • Four representative assignments • 512x512, 1Kx1K, 2Kx2K arrays of doubles distributed on Four processors • Six Machines • Assembly and Disassembly Rates

  21. Measurement Testcases (BLOCK,*) (CYCLIC,*) (BLOCK,*) (*,BLOCK) (CYCLIC,*) (BLOCK,*) (*,CYCLIC) (CYCLIC,*) Data Transpose

  22. Performance: DEC 3K/400a

  23. Performance:IBM 250 (PPC601)

  24. Performance: IBM SP2 (PWR2)

  25. Performance: Paragon

  26. Performance: Pentium 90

  27. Performance: Pentium 133

  28. Conclusions • Exploit “Superscalar Plateau” using compact address relation encodings • Cheap enough even for scalar machines • Generalized data transfer with hardware-limited throughput • Many possible applications

  29. Copying with Address Relations Data Items Copy Engine Data Items Sender Data Addresses Receiver Data Addresses Address Relation Decoder Address Relation Addresses Address Relation Data

  30. A Simple Copy Engine Comm. System Data Data Copy Engine Copy Engine Sender Data Adx Receiver Data Adx Decoder Decoder Address Relation Addresses Address Relation Data Address Relation Data Address Relation Addresses

More Related