1 / 39

Architectural Support for Synchronization-Free Deterministic Parallel Programming

Architectural Support for Synchronization-Free Deterministic Parallel Programming. Cedomir Segulja and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering. Parallel Programming is Hard. Chip multiprocessors are now commonplace

katina
Download Presentation

Architectural Support for Synchronization-Free Deterministic Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Support for Synchronization-Free Deterministic ParallelProgramming CedomirSegulja and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering

  2. Parallel Programming is Hard • Chip multiprocessors are now commonplace • Higher performance = coarse-grain parallelization • Parallel programming • Library-based approaches (e.g., Intel TBB) • Compiler directives (e.g., OpenMP) • Language extensions (e.g., Cilk and Cilk++) • … is still hard Explicit synchronization Non-deterministic execution 1 2 Introduction

  3. Can we Make it Easier? • We did it before! • OOO, Superscalar microarchitectures • Can we do it again? • Utilize architectural support to hide the nature of the underlying coarse-grain parallel hardware Introduction

  4. The Challenge Architectural support Programming model 1 2 No architectural support Explicitly parallel programming Complex architectural support Sequential programming model Introduction

  5. Versioning • A novel synchronization mechanism • Dynamically detects and enforces dependences among coarse-grain units of computation (tasks) • Parallel programming with sequential semantics • Provides Implicit synchronization Deterministic execution Function call 1 2 Introduction

  6. Outline • Motivation • Versioning • Architectural Support • Programming Support • Evaluation • Conclusions and Future Work Outline

  7. Versioning: Basic Idea Each running task maintains a pair of numbers for each shared memory location it accesses: an acquire number and a release number Local Version Table (LVT) 25 x x 22 50 36 100 36 x x 0 25 50 50 50 x 0 25 25 CPU 2 CPU 3 CPU 1 Every shared memory location is assigned a version number Global Version Table (GVT) x x 0 W/2 W/4 22 0 14 8 2 ACQ ACQ ACQ REL REL REL Versioning

  8. The Anatomy of a Memory Access Memory access Proceed with memory access LVT access LVT hit? No Yes GVT access Version match? Wait Yes No Versioning

  9. Architectural Support for Versioning On-chip LVT - stores task’s numbers for shared data Versioning Co-processor - implements the logic of a memory access - creates version numbers Counting Bloom Filter - stores a conservative estimate of task’s shared data In-memory GVT - stores version numbers for shared data Architectural Support

  10. The Anatomy of a Memory Access (2) CBF access Memory access CBF hit? Proceed with memory access No Yes LVT access LVT hit? No Yes GVT access Version match? Wait Trap No Yes Update the CBF Architectural Support

  11. 11 Programming Support / 20 • Pragma parallel • Asynchronous execution of function calls • Pragma access expression <read> • Describes functions’ side-effects • Can handle pointers, array and recursive accesses Programming Support

  12. Experimental Evaluation • Prototype implementation: ROKO • Software Platform • The Roko pre-compiler • CLANG/LLVM framework • The Roko run-time • C & Assembly • Hardware Platform • FPGA-based SMP system modified in order to support versioning • LEON3, SPARC V8 compliant processor Evaluation

  13. Benchmarks Evaluation

  14. Parallelization equivalent to 16% of 16KB L1 data cache Evaluation

  15. Application Results Evaluation

  16. Access Monitoring Overheads CBF access Percentage of the time spent executing LVT and GVT lookups Memory access Yes No CBF hit? Proceed with memory access LVT access No Yes LVT hit? GVT access Yes No Version match? Wait Update the CBF Evaluation

  17. Related Work • Architectural support for parallel programming • Programming models with implicit synchronization • Prometheus • Deterministic Parallel Java (DPJ) Related Work

  18. Conclusions • Versioning provides deterministic execution and alleviates the need for explicit synchronization • Support for concurrent reads • Assigning version numbers to adjustable regions of memory locations • Architectural support for versioning does not require intrusive changes to the processor Conclusions

  19. Conclusions • Proof-of-concept FPGA implementation delivers good performance in terms of application speedup • Low timing overheads • Requires on-chip storage equivalent to 16% of 16KB L1 data cache Conclusions

  20. Future Work • Support arbitrary units of computations • Compiler support • Assist in reporting functions’ side-effects • Filter data accesses that need to be monitored • Applicability of versioning to explicitly synchronized parallel codes Future Work

  21. Thank You! Thank You!

  22. Main Slides Architectural Support for Synchronization-Free Deterministic Parallel Programming Motivation Versioning Architectural Support Programming Support Evaluation Conclusions and Future Work Backup Slides • ROKO in More Details: • Software • Hardware • Evaluation in More Detail: • Overheads • Memory Contention • TBB Comparison • Versioning in More Details: • Concurrent Reads • Variable-granularity • Algorithms • Programming Support

  23. 23 Concurrent Reads / 25 In addition to the acquire and release numbers, each task also maintains a delta number for each shared memory address it accesses ACQ REL Δ 22 42 CPU 1 CPU 2 ACQ REL Δ x x 0 22 W 36 42 R/2 x 0 - R/2 Every shared memory location is assigned a write number and a read number Read access Write access 1 2 write read 0 22 14 0 x ACQ == write ? ACQ == write ? Δ+ read == R ? Concurrent Reads

  24. 22 36 24 Variable-granularity Versioning / 25 A task may use a different version number during acquire and release ACQ ACQ ACQ REL REL REL a a a r r r s.b s W/2 0 W/2 W - 2 - 1 1 1 1 2 1 1 1 s.b s.a s W/2 0 0 W/2 W/2 W/2 CPU 2 CPU 3 CPU 1 2 0 Version numbers are identified with an ID 1 0 Variable-granularity

  25. Versioning Algorithms (1) Algorithms

  26. Versioning Algorithms (2) • Before a task is spawn, its access table (AT) is created • Maps the variables reported with the access pragma to memory locations • For recursive accesses, the actual corresponding memory locations are identified by traversing the structures in memory, until a NULL pointer is encountered Algorithms

  27. Versioning Algorithms (3) Algorithms

  28. Describing Function’s Side-Effects • Describing pointer accesses • Describing array accesses • Describing recursive accesses • Some access may be omitted • Read-only data • Structure entry points Programming Model

  29. ROKO Software • Precompiler • Bare-C Cross Compilation System (BCC) • GNU C/C++ crosscompiler (3.4.4), • GNU Binutils-2.16.1 • Newlib 1.13.0 Embedded C-library. • Scheduler • A task is scheduled in its respective sequential order • On a version mismatch: • If there exists a currently non-running sequentially earlier task, switch • Otherwise, continue by re-executing the faulting memory instruction ROKO in Details

  30. ROKO Hardware • LVT entry = 128 bits ROKO in Details

  31. Versioning Overheads Overheads

  32. Benchmarks Overheads

  33. Task Management Overheads (1) Overheads

  34. Task Management Overheads (2) Overheads

  35. Task Management Overhead (3) Overheads

  36. Memory Contention Overheads

  37. TBB Comparison • Linux 2.6.21.1 • Issues: • Virtual memory, library functions (e.g., memory allocator and math functions) and process scheduling • Effort: • Static compilation with the same math library, real-time scheduling in Linux • Sequential execution times where within 5% • Health 23.3% increase • Union 111.9% increase • We suspect this is due to malloc Evaluation

  38. Parallelization Evaluation

  39. Bloom filter • Fast structure • Holds only memory locations for which the comparison of version numbers must be done • Probabilistic • May report false positives • May indicate that the LVT needs to be accessed when not necessary • But never reports false negatives • A CBF miss always means that the memory access can be executed Architectural Support

More Related