180 likes | 212 Views
Remote Store Programming. Henry Hoffmann David Wentzlaff Anant Agarwal High Performance Embedded Computing Workshop September 2009. Remote Store Programming Outline. Introduction/Motivation Key Features of RSP Usability Performance Evaluation Methodology Results Conclusion.
E N D
Remote Store Programming Henry Hoffmann David Wentzlaff Anant Agarwal High Performance Embedded Computing Workshop September 2009
Remote Store Programming Outline • Introduction/Motivation • Key Features of RSP • Usability • Performance • Evaluation • Methodology • Results • Conclusion
Multicore requires innovation to balance usability and performance • Parallel programming is becoming ubiquitous • Parallel programming is no longer the domain of select experts • Balancing ease-of-use and performance is more important than ever Cavium – 16 cores Tilera – 64 cores Intel – 4 cores Sun – 8 cores IBM – 9 cores Raw – 16 cores
Existing programming models do no combine usability and performance Threads with cache coherence (CC) Direct Memory Access (DMA) • Familiar loads and stores • Fine-grained comm. is easy Globally Shared Cache data Core 0 Core 1 Core 0 Core 1 Local Cache Local Cache Local Memory Local Memory data data data data store load store load RF RF RF DMA RF DMA Usability • Requires additional software API • Hard to schedule DMA transactions Performance • Programmer has complete control over locality • No control over locality Remote Store Programming (RSP) can combine usability with performance
Globally Shared Cache data Core 0 Core 1 Core 0 Core 1 Local Cache Local Cache Local Memory Local Memory data data data data RSP Threads DMA store load store load RF RF RF DMA RF DMA Core 0 Core 1 Local Cache Local Cache 2 data store load 1 RF RF RSP combines the usability of Threads with the performance of DMA Implications 1 • Familiar loads and stores • Fine-grained • One-sided Can develop high performance programs in a shorter amount of time Usability 2 Always access physically close memory and minimize load latency Performance • Software controls locality
Remote Store Programming Outline • Introduction/Motivation • Key Features of RSP • Usability • Performance • Evaluation • Methodology • Results • Conclusion
Memory 0 private Memory 1 private Remotely writable buffer Process 0 buffer = map_remote(4,1); *buffer = 42; barrier(); Process 1 buffer = remote_write_alloc(4,0); barrier(); print(buffer); Core 0 Core 1 1 RSP captures the usability of threads and shared memory Usability 2 Performance • Process Model • Each process has private memory by default • A process can grant write access to remote processes • Communication • Processes communicate by storing to remotely writable memory • Synchronization • Supports test-and-set, compare-and-swap, etc. • We assumer higher level primitives like barrier
Core 0 Core 1 Local Cache Local Cache buffer network store load RF RF Locality 1 RSP emphasizes locality for performance on large scale multicores Usability 2 Performance • Cores access close memories • Load latency is minimized Core 0 is a producer: Data transferred from Core 0’s registers to Core 1’s cache Core 1 is a consumer: All loads guaranteed to be local and minimum latency 1 2 3 Fine Grain Comm. • Complete overlap of communication and computation 4
1 Usability 2 Performance Globally Shared Cache data Core 0 Core 1 Core 0 Core 1 Local Cache Local Cache Local Memory Local Memory data data data data store load store load RF RF RF DMA RF DMA 2D FFT example illustrates performance and usability Threads DMA RSP Core 0 Core 1 Matrix A, C; shared Matrix B; pid = my_id(); init(A); row_fft(B,A, pid); barrier(); col_fft(C,B,pid); Local Cache Local Cache data store load RF RF Matrix A, B, B2, C; pid = my_id(); init(A); row_fft(B,A, pid); DMA_move(B2,B); col_fft(C,B2,pid); Matrix A, C; Matrix B = alloc_rsp(); pid = my_id(); init(A); row_fft(B,A, pid); barrier(); col_fft(C,B,pid); Easiest to write Fine-grain No locality Hard to write Coarse-grain High locality Easy to write Fine-grain High locality For more detail see: Hoffmann, Wentzlaff, Agarwal. Remote Store Programming: Mechanisms and Performance. Technical Report MIT-CSAIL-TR-2009-017. May, 2009
Core 0 Core 1 Local Cache Local Cache buffer network store load RF RF RSP requires incremental hardware support • RSP requires incremental additional hardware • In processor supporting cache coherent shared memory • Additional memory allocation mode for remotely writable memory • Do not update local cache when writing to remote memory • In processor without cache coherent shared memory • Memory allocation for remotely writable memory • Write misses on remote cores forward miss to allocating core
Remote Store Programming Outline • Introduction/Motivation • Key Differentiators of RSP • Usability • Performance • Evaluation • Methodology • Results • Conclusion
RSP, Threads, and DMA are compared on the TILEPro64 processor Emulate RSP on TILEPro64 Implement embedded benchmarks with RSP Compare performance to cache-coherence and DMA • TILEPro64 supports Cache-coherence and DMA • Additional memory allocation modes let us emulate RSP • Compare speedup and load latency
64 64 64 64 64 64 64 64 80.6 Error Diff. 2D FFT Bitonic sort Convolution RSP RSP RSP RSP RSP RSP RSP RSP 56 56 56 56 56 56 56 56 CC CC CC CC CC CC CC CC 48 48 48 48 48 48 48 48 40 40 40 40 40 40 40 40 32 32 32 32 32 32 32 32 24 24 24 24 24 24 24 24 Image Hist. Matrix Trans. H.264 Matrix Mul. 16 16 16 16 16 16 16 16 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 2 4 8 16 32 40 2 2 4 4 8 8 16 16 32 32 64 2 2 4 4 8 8 16 16 32 32 64 2 4 8 16 32 64 64 64 2 4 8 16 32 64 2 4 8 16 32 64 Speedups of RSP and cache-coherent benchmarks Speedup Speedup Cores Cores Cores Cores
RSP is outperforms threading with cache coherence for large numbers of cores Speedup of RSP versus Cache Coherence for selected benchmarks (higher is better) • For large numbers of cores, RSP provides significant benefit • 5 RSP apps are 1.5x faster than CC for 64 cores • RSP FFT is > 3x faster than CC for 64 cores Speed of RSP/Speed of CC Cores
RSP’s emphasis on locality results in low load latency Load latency of selected benchmarks on RSP versus Cache Coherence (lower is better) Average load latency of RSP/CC Cores
3.5 Cache Coherence 3 DMA Cache Coherence DMA RSP RSP 2.5 2 Performance Normalized to Shared Memory 1.5 1 0.5 0 2 4 8 16 32 40 Cores Comparison of RSP, Threading, and DMA for two applications H.264 Encode 2D FFT • RSP is faster than Threading and DMA due to fine emphasis on locality and fine-grain communication 3.5 3 2.5 2 Performance Relative to Shared Memory 1.5 1 0.5 0 2 4 8 16 32 64 Cores
Remote Store Programming Outline • Introduction/Motivation • Key Differentiators of RSP • Usability • Performance • Hardware Requirements • Evaluation • Methodology • Results • Conclusion
Conclusion • Talk presents Remote Store Programming • A programming model for multicore • Uses familiar communication mechanisms • Achieves high-performance through locality and fine-grain communication • Requires incremental hardware support • Conclusion: • Threads and shared memory good performance and easy to use • For large numbers of cores RSP can outperform threading because of greater locality • RSP is easier to use than DMA and slightly harder than Threads • Anticipated use for RSP: • Can supplement threads and cache-coherent shared memory on multicore • Most code uses standard threading and shared memory techniques • Performance critical sections of code or applications use RSP for additional performance • Gentle Slope to programming multicore