250 likes | 414 Views
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Isaac Gelado , Javier Cabezas . John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu. 1. Introduction: Heterogeneous Computing. CPU. ACC. Existent programming models are DMA based: Explicit memory copy
E N D
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro and Wen-meiHwu ASPLOS 2010 -- Pittsburgh
1. Introduction: Heterogeneous Computing CPU ACC • Existent programming models are DMA based: • Explicit memory copy • Programmer-managed memory coherence CPU IN OUT ACC IN OUT ASPLOS 2010 -- Pittsburgh • Heterogeneous Parallel Systems: • CPU: sequential control-intensive code • Accelerators: massively data-parallel code
Outline ASPLOS 2010 -- Pittsburgh Introduction Motivation ADSM: Asymmetric Distributed Shared Memory GMAC: Global Memory for ACcelerators Experimental Results Conclusions
RAM Memory RAM Memory RAM Memory RAM Memory 2.1 Motivation: Reference System Device Memory • High bandwidth • Weak consistency • Large page size GPU-like Accelerator PCIe Bus • Low latency • Strong consistency • Small page size CPU (N - Cores) System Memory ASPLOS 2010 -- Pittsburgh
2.2 Motivation: Memory Requirements • Accelerator memoryalways growing (e.g. 6GB NVIDIA Fermi, 16GB PowerXCell 8i) ASPLOS 2010 -- Pittsburgh • High memory bandwidth requirements • Non fully-coherent systems: • Long-latency coherence traffic • Different coherence protocols
2.3 Motivation: DMA-Based Programming • CUDA Sample Code • Duplicated Pointers • Explicit Coherence Management foo foo foo foo CPU GPU ASPLOS 2010 -- Pittsburgh void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel<<<Dg, Db>>>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo); }
3.1 ADSM: Unified Virtual Address Space bar foo foo baz ACC CPU Shared Data Object System Memory Device Memory ASPLOS 2010 -- Pittsburgh • Unified Virtual Shared Address Space • CPU: access both, system and accelerator memory • Accelerator: access to its own memory • Under ADSM, both will use the same virtual address when referencing the shared object
3.2 ADSM: Simplified Code foo CPU GPU Single Pointer void compute(FILE *file, int size) { float *foo; foo = adsmMalloc(size); fread(foo, size, 1, file); kernel<<<Dg, Db>>>(foo, size); cpuComputation(foo); adsmFree(foo); } Data Assignment Peer DMA Legacy Support ASPLOS 2010 -- Pittsburgh • Simpler CPU code than in DMA-based programming models • Hardware-independent code
3.3 ADSM: Memory Distribution ASPLOS 2010 -- Pittsburgh • Asymmetric Distributed Shared Memory principles: • CPU accesses objects in accelerator memory but not vice versa • All coherency actions are performed by the CPU • Trashing unlikely to happen: • Synchronization Variables: Interrupt-based and dedicated hardware • False-sharing: Data object sharing granularity
3.4 ADSM: Consistency and Coherence Accelerator Return Foo Accelerator Call CPU ACC Foo • Memory Coherence: • Data ownership information enables eager data transfers • CPU maintains coherency CPU ACC ASPLOS 2010 -- Pittsburgh • Release consistency: • Consistency only relevant from CPU perspective • Implicit release/acquire at accelerator call/return
4. Global Memory for Accelerators ASPLOS 2010 -- Pittsburgh • ADSM implementation • User-level shared library • GNU / Linux Systems • NVIDIA CUDA GPUs
4.1 GMAC: Overall Design CUDA-like Front-End Memory Manager (Different Policies) Kernel Scheduler (FIFO) Operating System Abstraction Layer Accelerator Abstraction Layer (CUDA) ASPLOS 2010 -- Pittsburgh • Layered Design: • Multiple Memory Consistency Protocols • Operating System and Accelerator Independent code
4.2 GMAC: Unified Address Space • GPU memory address range cannot be selected GPU Physical Address Space System Virtual Address Space • Allocate same virtual memory address range in both, GPU and CPU • Accelerator Virtual memory would ease this process ASPLOS 2010 -- Pittsburgh • Virtual Address Space formed by GPU and System physical memories
4.3 GMAC: Coherence Protocols ASPLOS 2010 -- Pittsburgh • Batch-Update: copy all shared objects • Lazy-Update: copy modified / needed shared objects • Data object granularity • Detect CPU read/write accesses to shared objects • Rolling-Update: copy only modified / needed memory • Memory block size granularity • Fixed maximum number of modified blocks in system memory flush data when maximum is reached
5.1 Results: GMAC vs. CUDA • Batch-Update overheads: • Copy output data on call • Copy non-used data • Similar performance for CUDA, Lazy-Update and Rolling-Update ASPLOS 2010 -- Pittsburgh
5.2 Results: Lazy vs. Rolling on 3D Stencil • Extra data copy for small data objects • Trade-off between bandwidth and page fault overhead ASPLOS 2010 -- Pittsburgh
6. Conclusions ASPLOS 2010 -- Pittsburgh • Unified virtual shared address space simplifies programming of heterogeneous systems • Asymmetric Distributed Shared Memory • CPU access accelerator memory but not vice versa • Coherence actions only executed by CPU • Experimental results shows no performance degradation • Memory translation in accelerators is key to efficient implement ADSM
Thank you for your attention Eager to start using GMAC? http://code.google.com/p/adsm/ igelado@ac.upc.edu adsm-users@googlegroups.com ASPLOS 2010 -- Pittsburgh
Backup Slides ASPLOS 2010 -- Pittsburgh
4.4 GMAC: Memory Mapping • Allocation might fail if the range is in use GPU Physical Address Space System Virtual Address Space ASPLOS 2010 -- Pittsburgh • Software: allocate different address space and provide translation function (gmacSafePtr()) • Hardware: implement virtual memory in the GPU
4.5 GMAC: Protocol States • Batch-Update: • Call / Return • Lazy-Update: • Call / Return • Read / Write • Rolling-Update: • Call / Return • Read / Write • Flush Invalid Return Dirty Call Read Invalid Read Only Call Flush Write Call Dirty Write ASPLOS 2010 -- Pittsburgh Protocol States: Invalid, Read-only, Dirty
4.6 GMAC: Rolling vs. Lazy • Batch – Update: transfer on kernel call • Rolling – Update: transfer while CPU computes ASPLOS 2010 -- Pittsburgh
5.3 Results: Break-down of Execution ASPLOS 2010 -- Pittsburgh
5.4 Results: Rolling Size vs. Block Size • Small Rolling size leads to performance aberrations • Prefer relative large rolling sizes ASPLOS 2010 -- Pittsburgh No appreciable effect on most benchmarks
6.1 Conclusions: Wish—list ASPLOS 2010 -- Pittsburgh • GPU Anonymous Memory Mappings: • GPU to CPU mappings never fail • Dynamic memory re—allocations • GPU dynamic Pinned Memory: • No intermediate data copies on flush • Peer DMA: • Speed—up I/O operations • No intermediate copies on GPU-to-GPU copies