400 likes | 536 Views
Architectural Support for Synchronization-Free Deterministic Parallel Programming. Cedomir Segulja and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering. Parallel Programming is Hard. Chip multiprocessors are now commonplace
E N D
Architectural Support for Synchronization-Free Deterministic ParallelProgramming CedomirSegulja and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
Parallel Programming is Hard • Chip multiprocessors are now commonplace • Higher performance = coarse-grain parallelization • Parallel programming • Library-based approaches (e.g., Intel TBB) • Compiler directives (e.g., OpenMP) • Language extensions (e.g., Cilk and Cilk++) • … is still hard Explicit synchronization Non-deterministic execution 1 2 Introduction
Can we Make it Easier? • We did it before! • OOO, Superscalar microarchitectures • Can we do it again? • Utilize architectural support to hide the nature of the underlying coarse-grain parallel hardware Introduction
The Challenge Architectural support Programming model 1 2 No architectural support Explicitly parallel programming Complex architectural support Sequential programming model Introduction
Versioning • A novel synchronization mechanism • Dynamically detects and enforces dependences among coarse-grain units of computation (tasks) • Parallel programming with sequential semantics • Provides Implicit synchronization Deterministic execution Function call 1 2 Introduction
Outline • Motivation • Versioning • Architectural Support • Programming Support • Evaluation • Conclusions and Future Work Outline
Versioning: Basic Idea Each running task maintains a pair of numbers for each shared memory location it accesses: an acquire number and a release number Local Version Table (LVT) 25 x x 22 50 36 100 36 x x 0 25 50 50 50 x 0 25 25 CPU 2 CPU 3 CPU 1 Every shared memory location is assigned a version number Global Version Table (GVT) x x 0 W/2 W/4 22 0 14 8 2 ACQ ACQ ACQ REL REL REL Versioning
The Anatomy of a Memory Access Memory access Proceed with memory access LVT access LVT hit? No Yes GVT access Version match? Wait Yes No Versioning
Architectural Support for Versioning On-chip LVT - stores task’s numbers for shared data Versioning Co-processor - implements the logic of a memory access - creates version numbers Counting Bloom Filter - stores a conservative estimate of task’s shared data In-memory GVT - stores version numbers for shared data Architectural Support
The Anatomy of a Memory Access (2) CBF access Memory access CBF hit? Proceed with memory access No Yes LVT access LVT hit? No Yes GVT access Version match? Wait Trap No Yes Update the CBF Architectural Support
11 Programming Support / 20 • Pragma parallel • Asynchronous execution of function calls • Pragma access expression <read> • Describes functions’ side-effects • Can handle pointers, array and recursive accesses Programming Support
Experimental Evaluation • Prototype implementation: ROKO • Software Platform • The Roko pre-compiler • CLANG/LLVM framework • The Roko run-time • C & Assembly • Hardware Platform • FPGA-based SMP system modified in order to support versioning • LEON3, SPARC V8 compliant processor Evaluation
Benchmarks Evaluation
Parallelization equivalent to 16% of 16KB L1 data cache Evaluation
Application Results Evaluation
Access Monitoring Overheads CBF access Percentage of the time spent executing LVT and GVT lookups Memory access Yes No CBF hit? Proceed with memory access LVT access No Yes LVT hit? GVT access Yes No Version match? Wait Update the CBF Evaluation
Related Work • Architectural support for parallel programming • Programming models with implicit synchronization • Prometheus • Deterministic Parallel Java (DPJ) Related Work
Conclusions • Versioning provides deterministic execution and alleviates the need for explicit synchronization • Support for concurrent reads • Assigning version numbers to adjustable regions of memory locations • Architectural support for versioning does not require intrusive changes to the processor Conclusions
Conclusions • Proof-of-concept FPGA implementation delivers good performance in terms of application speedup • Low timing overheads • Requires on-chip storage equivalent to 16% of 16KB L1 data cache Conclusions
Future Work • Support arbitrary units of computations • Compiler support • Assist in reporting functions’ side-effects • Filter data accesses that need to be monitored • Applicability of versioning to explicitly synchronized parallel codes Future Work
Thank You! Thank You!
Main Slides Architectural Support for Synchronization-Free Deterministic Parallel Programming Motivation Versioning Architectural Support Programming Support Evaluation Conclusions and Future Work Backup Slides • ROKO in More Details: • Software • Hardware • Evaluation in More Detail: • Overheads • Memory Contention • TBB Comparison • Versioning in More Details: • Concurrent Reads • Variable-granularity • Algorithms • Programming Support
23 Concurrent Reads / 25 In addition to the acquire and release numbers, each task also maintains a delta number for each shared memory address it accesses ACQ REL Δ 22 42 CPU 1 CPU 2 ACQ REL Δ x x 0 22 W 36 42 R/2 x 0 - R/2 Every shared memory location is assigned a write number and a read number Read access Write access 1 2 write read 0 22 14 0 x ACQ == write ? ACQ == write ? Δ+ read == R ? Concurrent Reads
22 36 24 Variable-granularity Versioning / 25 A task may use a different version number during acquire and release ACQ ACQ ACQ REL REL REL a a a r r r s.b s W/2 0 W/2 W - 2 - 1 1 1 1 2 1 1 1 s.b s.a s W/2 0 0 W/2 W/2 W/2 CPU 2 CPU 3 CPU 1 2 0 Version numbers are identified with an ID 1 0 Variable-granularity
Versioning Algorithms (1) Algorithms
Versioning Algorithms (2) • Before a task is spawn, its access table (AT) is created • Maps the variables reported with the access pragma to memory locations • For recursive accesses, the actual corresponding memory locations are identified by traversing the structures in memory, until a NULL pointer is encountered Algorithms
Versioning Algorithms (3) Algorithms
Describing Function’s Side-Effects • Describing pointer accesses • Describing array accesses • Describing recursive accesses • Some access may be omitted • Read-only data • Structure entry points Programming Model
ROKO Software • Precompiler • Bare-C Cross Compilation System (BCC) • GNU C/C++ crosscompiler (3.4.4), • GNU Binutils-2.16.1 • Newlib 1.13.0 Embedded C-library. • Scheduler • A task is scheduled in its respective sequential order • On a version mismatch: • If there exists a currently non-running sequentially earlier task, switch • Otherwise, continue by re-executing the faulting memory instruction ROKO in Details
ROKO Hardware • LVT entry = 128 bits ROKO in Details
Versioning Overheads Overheads
Benchmarks Overheads
Task Management Overheads (1) Overheads
Task Management Overheads (2) Overheads
Task Management Overhead (3) Overheads
Memory Contention Overheads
TBB Comparison • Linux 2.6.21.1 • Issues: • Virtual memory, library functions (e.g., memory allocator and math functions) and process scheduling • Effort: • Static compilation with the same math library, real-time scheduling in Linux • Sequential execution times where within 5% • Health 23.3% increase • Union 111.9% increase • We suspect this is due to malloc Evaluation
Parallelization Evaluation
Bloom filter • Fast structure • Holds only memory locations for which the comparison of version numbers must be done • Probabilistic • May report false positives • May indicate that the LVT needs to be accessed when not necessary • But never reports false negatives • A CBF miss always means that the memory access can be executed Architectural Support