1 / 25

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Isaac Gelado , Javier Cabezas . John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu. 1. Introduction: Heterogeneous Computing. CPU. ACC. Existent programming models are DMA based: Explicit memory copy

kyrie
Download Presentation

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro and Wen-meiHwu ASPLOS 2010 -- Pittsburgh

  2. 1. Introduction: Heterogeneous Computing CPU ACC • Existent programming models are DMA based: • Explicit memory copy • Programmer-managed memory coherence CPU IN OUT ACC IN OUT ASPLOS 2010 -- Pittsburgh • Heterogeneous Parallel Systems: • CPU: sequential control-intensive code • Accelerators: massively data-parallel code

  3. Outline ASPLOS 2010 -- Pittsburgh Introduction Motivation ADSM: Asymmetric Distributed Shared Memory GMAC: Global Memory for ACcelerators Experimental Results Conclusions

  4. RAM Memory RAM Memory RAM Memory RAM Memory 2.1 Motivation: Reference System Device Memory • High bandwidth • Weak consistency • Large page size GPU-like Accelerator PCIe Bus • Low latency • Strong consistency • Small page size CPU (N - Cores) System Memory ASPLOS 2010 -- Pittsburgh

  5. 2.2 Motivation: Memory Requirements • Accelerator memoryalways growing (e.g. 6GB NVIDIA Fermi, 16GB PowerXCell 8i) ASPLOS 2010 -- Pittsburgh • High memory bandwidth requirements • Non fully-coherent systems: • Long-latency coherence traffic • Different coherence protocols

  6. 2.3 Motivation: DMA-Based Programming • CUDA Sample Code • Duplicated Pointers • Explicit Coherence Management foo foo foo foo CPU GPU ASPLOS 2010 -- Pittsburgh void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel<<<Dg, Db>>>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo); }

  7. 3.1 ADSM: Unified Virtual Address Space bar foo foo baz ACC CPU Shared Data Object System Memory Device Memory ASPLOS 2010 -- Pittsburgh • Unified Virtual Shared Address Space • CPU: access both, system and accelerator memory • Accelerator: access to its own memory • Under ADSM, both will use the same virtual address when referencing the shared object

  8. 3.2 ADSM: Simplified Code foo CPU GPU Single Pointer void compute(FILE *file, int size) { float *foo; foo = adsmMalloc(size); fread(foo, size, 1, file); kernel<<<Dg, Db>>>(foo, size); cpuComputation(foo); adsmFree(foo); } Data Assignment Peer DMA Legacy Support ASPLOS 2010 -- Pittsburgh • Simpler CPU code than in DMA-based programming models • Hardware-independent code

  9. 3.3 ADSM: Memory Distribution ASPLOS 2010 -- Pittsburgh • Asymmetric Distributed Shared Memory principles: • CPU accesses objects in accelerator memory but not vice versa • All coherency actions are performed by the CPU • Trashing unlikely to happen: • Synchronization Variables: Interrupt-based and dedicated hardware • False-sharing: Data object sharing granularity

  10. 3.4 ADSM: Consistency and Coherence Accelerator Return Foo Accelerator Call CPU ACC Foo • Memory Coherence: • Data ownership information enables eager data transfers • CPU maintains coherency CPU ACC ASPLOS 2010 -- Pittsburgh • Release consistency: • Consistency only relevant from CPU perspective • Implicit release/acquire at accelerator call/return

  11. 4. Global Memory for Accelerators ASPLOS 2010 -- Pittsburgh • ADSM implementation • User-level shared library • GNU / Linux Systems • NVIDIA CUDA GPUs

  12. 4.1 GMAC: Overall Design CUDA-like Front-End Memory Manager (Different Policies) Kernel Scheduler (FIFO) Operating System Abstraction Layer Accelerator Abstraction Layer (CUDA) ASPLOS 2010 -- Pittsburgh • Layered Design: • Multiple Memory Consistency Protocols • Operating System and Accelerator Independent code

  13. 4.2 GMAC: Unified Address Space • GPU memory address range cannot be selected GPU Physical Address Space System Virtual Address Space • Allocate same virtual memory address range in both, GPU and CPU • Accelerator Virtual memory would ease this process ASPLOS 2010 -- Pittsburgh • Virtual Address Space formed by GPU and System physical memories

  14. 4.3 GMAC: Coherence Protocols ASPLOS 2010 -- Pittsburgh • Batch-Update: copy all shared objects • Lazy-Update: copy modified / needed shared objects • Data object granularity • Detect CPU read/write accesses to shared objects • Rolling-Update: copy only modified / needed memory • Memory block size granularity • Fixed maximum number of modified blocks in system memory  flush data when maximum is reached

  15. 5.1 Results: GMAC vs. CUDA • Batch-Update overheads: • Copy output data on call • Copy non-used data • Similar performance for CUDA, Lazy-Update and Rolling-Update ASPLOS 2010 -- Pittsburgh

  16. 5.2 Results: Lazy vs. Rolling on 3D Stencil • Extra data copy for small data objects • Trade-off between bandwidth and page fault overhead ASPLOS 2010 -- Pittsburgh

  17. 6. Conclusions ASPLOS 2010 -- Pittsburgh • Unified virtual shared address space simplifies programming of heterogeneous systems • Asymmetric Distributed Shared Memory • CPU access accelerator memory but not vice versa • Coherence actions only executed by CPU • Experimental results shows no performance degradation • Memory translation in accelerators is key to efficient implement ADSM

  18. Thank you for your attention Eager to start using GMAC? http://code.google.com/p/adsm/ igelado@ac.upc.edu adsm-users@googlegroups.com ASPLOS 2010 -- Pittsburgh

  19. Backup Slides ASPLOS 2010 -- Pittsburgh

  20. 4.4 GMAC: Memory Mapping • Allocation might fail if the range is in use GPU Physical Address Space System Virtual Address Space ASPLOS 2010 -- Pittsburgh • Software: allocate different address space and provide translation function (gmacSafePtr()) • Hardware: implement virtual memory in the GPU

  21. 4.5 GMAC: Protocol States • Batch-Update: • Call / Return • Lazy-Update: • Call / Return • Read / Write • Rolling-Update: • Call / Return • Read / Write • Flush Invalid Return Dirty Call Read Invalid Read Only Call Flush Write Call Dirty Write ASPLOS 2010 -- Pittsburgh Protocol States: Invalid, Read-only, Dirty

  22. 4.6 GMAC: Rolling vs. Lazy • Batch – Update: transfer on kernel call • Rolling – Update: transfer while CPU computes ASPLOS 2010 -- Pittsburgh

  23. 5.3 Results: Break-down of Execution ASPLOS 2010 -- Pittsburgh

  24. 5.4 Results: Rolling Size vs. Block Size • Small Rolling size leads to performance aberrations • Prefer relative large rolling sizes ASPLOS 2010 -- Pittsburgh No appreciable effect on most benchmarks

  25. 6.1 Conclusions: Wish—list ASPLOS 2010 -- Pittsburgh • GPU Anonymous Memory Mappings: • GPU to CPU mappings never fail • Dynamic memory re—allocations • GPU dynamic Pinned Memory: • No intermediate data copies on flush • Peer DMA: • Speed—up I/O operations • No intermediate copies on GPU-to-GPU copies

More Related