300 likes | 427 Views
Generalized Data Transfers At Memory Bandwidth. Peter A. Dinda David R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh. Generalized Data Transfers. Sending Node Memory. Receiving Node Memory. A. D. B. E. C. F. Address Relations.
E N D
Generalized Data TransfersAt Memory Bandwidth Peter A. DindaDavid R. O’Hallaron Carnegie Mellon University http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~droh
Generalized Data Transfers Sending Node Memory Receiving Node Memory A D B E C F
Address Relations Sending Node Memory Receiving Node Memory D A B E C F {(A,F),(B,D),(C,E)} R={(x,y) | data item at address x on sender is copied to address y on receiver}
Send/Recv Implementation Receiving Node Memory Sending Node Memory A D B E C {(A,F), (B,D), (C,E)} F Message Disassembly Message Assembly Message Contents Data Transfer (also put and get communication models)
Storing Address Relations Compute Address Relation - “Inspector” Done Once while not done compute_address_pair(x,y) store_address_pair(x,y) end while Assemble Message - “Executor” while not done get_address_pair(x,y) buffer[i++]=data[x] end while Repeated Many Times
Inspector/Executor [Salz, et al] In-line Computation Inspector/Executor i=1 Inspector i=1 do i=1,1000 call Work() call COPY() call Work() enddo Executor i=2 i=2 Executor i=3 i=3 Executor i=3 Executor
dim A(N,N),B(N,N) do i=1,1000 call Work(A) call Work(B) end Context: Array Assignments B=A Array A Array B Abstraction We concentrate on B=A and B=TRANSPOSE(A) More general forms exist
Distributed Arrays Regular Block-cyclic distributions as in High Performance Fortran(HPF) (*,CYCLIC) (*,BLOCK) (*,CYCLIC(k)) Distribution Elements Processor 0 Owns Local Array on Processor 0
Representative Assignments (BLOCK,*) (CYCLIC,*) (BLOCK,*) (*,BLOCK) (CYCLIC,*) (BLOCK,*) (*,CYCLIC) (CYCLIC,*) Data Transpose
Representing Address Relations • General Purpose • Space Efficiency • Hardware Limited Performance • In-line expansion
AAPAIR: Simple Representation Sending Node Memory Receiving Node Memory D A A F B E B D C F C E {(A,F),(B,D),(C,E)} Simple sequence of pointer pairs PROBLEM: Space Efficiency PROBLEM: Performance
AABLK: Run-length Encoding D A 2 A F 2 B E B D 2 C E C F {(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)} Sequence of pointer, pointer, length triples PROBLEM: Strided Access
DMRLE: Handling Strides D A 1 A F g h 2 B E g h g C F h {(A,F),(B,E),(C,D)} B-A = C-B = g E-F = D-E = h sequence of offset, offset, length triples PROBLEM: Repeated Strides
DMRLEC: Repeated Strides D A E 0 1 2 1 h g B h 1 F 0: A F g C 2 1: g h v D’ 1 2: u v u A’ E’ h g B’ h F’ g C’ {(A,F),(B,E),(C,D), (A’,F’),(B’,E’),(C’,D’)} B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h A’-C = u and F’-D=v Sequence of indices into table of offset, offset, length triples
Copying & Superscalar Plateau Issued at time t Time load store load store load store store load ... stall stall stall stall Free Issue Slots ... p ... Plateau = np = 2*3= 6 n Maximum number of non load/store instructions before copy bandwidth suffers
Measurement Details • Portable Library written in C • Four representative assignments • 512x512, 1Kx1K, 2Kx2K arrays of doubles distributed on Four processors • Six Machines • Assembly and Disassembly Rates
Measurement Testcases (BLOCK,*) (CYCLIC,*) (BLOCK,*) (*,BLOCK) (CYCLIC,*) (BLOCK,*) (*,CYCLIC) (CYCLIC,*) Data Transpose
Conclusions • Exploit “Superscalar Plateau” using compact address relation encodings • Cheap enough even for scalar machines • Generalized data transfer with hardware-limited throughput • Many possible applications
Copying with Address Relations Data Items Copy Engine Data Items Sender Data Addresses Receiver Data Addresses Address Relation Decoder Address Relation Addresses Address Relation Data
A Simple Copy Engine Comm. System Data Data Copy Engine Copy Engine Sender Data Adx Receiver Data Adx Decoder Decoder Address Relation Addresses Address Relation Data Address Relation Data Address Relation Addresses