Architectural Support for Synchronization-Free Deterministic Parallel Programming

Architectural Support for Synchronization-Free Deterministic ParallelProgramming CedomirSegulja and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Parallel Programming is Hard • Chip multiprocessors are now commonplace • Higher performance = coarse-grain parallelization • Parallel programming • Library-based approaches (e.g., Intel TBB) • Compiler directives (e.g., OpenMP) • Language extensions (e.g., Cilk and Cilk++) • … is still hard Explicit synchronization Non-deterministic execution 1 2 Introduction

Can we Make it Easier? • We did it before! • OOO, Superscalar microarchitectures • Can we do it again? • Utilize architectural support to hide the nature of the underlying coarse-grain parallel hardware Introduction

The Challenge Architectural support Programming model 1 2 No architectural support Explicitly parallel programming Complex architectural support Sequential programming model Introduction

Versioning • A novel synchronization mechanism • Dynamically detects and enforces dependences among coarse-grain units of computation (tasks) • Parallel programming with sequential semantics • Provides Implicit synchronization Deterministic execution Function call 1 2 Introduction

Outline • Motivation • Versioning • Architectural Support • Programming Support • Evaluation • Conclusions and Future Work Outline

Versioning: Basic Idea Each running task maintains a pair of numbers for each shared memory location it accesses: an acquire number and a release number Local Version Table (LVT) 25 x x 22 50 36 100 36 x x 0 25 50 50 50 x 0 25 25 CPU 2 CPU 3 CPU 1 Every shared memory location is assigned a version number Global Version Table (GVT) x x 0 W/2 W/4 22 0 14 8 2 ACQ ACQ ACQ REL REL REL Versioning

The Anatomy of a Memory Access Memory access Proceed with memory access LVT access LVT hit? No Yes GVT access Version match? Wait Yes No Versioning

Architectural Support for Versioning On-chip LVT - stores task’s numbers for shared data Versioning Co-processor - implements the logic of a memory access - creates version numbers Counting Bloom Filter - stores a conservative estimate of task’s shared data In-memory GVT - stores version numbers for shared data Architectural Support

The Anatomy of a Memory Access (2) CBF access Memory access CBF hit? Proceed with memory access No Yes LVT access LVT hit? No Yes GVT access Version match? Wait Trap No Yes Update the CBF Architectural Support

11 Programming Support / 20 • Pragma parallel • Asynchronous execution of function calls • Pragma access expression <read> • Describes functions’ side-effects • Can handle pointers, array and recursive accesses Programming Support

Experimental Evaluation • Prototype implementation: ROKO • Software Platform • The Roko pre-compiler • CLANG/LLVM framework • The Roko run-time • C & Assembly • Hardware Platform • FPGA-based SMP system modified in order to support versioning • LEON3, SPARC V8 compliant processor Evaluation

Benchmarks Evaluation

Parallelization equivalent to 16% of 16KB L1 data cache Evaluation

Application Results Evaluation

Access Monitoring Overheads CBF access Percentage of the time spent executing LVT and GVT lookups Memory access Yes No CBF hit? Proceed with memory access LVT access No Yes LVT hit? GVT access Yes No Version match? Wait Update the CBF Evaluation

Related Work • Architectural support for parallel programming • Programming models with implicit synchronization • Prometheus • Deterministic Parallel Java (DPJ) Related Work

Conclusions • Versioning provides deterministic execution and alleviates the need for explicit synchronization • Support for concurrent reads • Assigning version numbers to adjustable regions of memory locations • Architectural support for versioning does not require intrusive changes to the processor Conclusions

Conclusions • Proof-of-concept FPGA implementation delivers good performance in terms of application speedup • Low timing overheads • Requires on-chip storage equivalent to 16% of 16KB L1 data cache Conclusions

Future Work • Support arbitrary units of computations • Compiler support • Assist in reporting functions’ side-effects • Filter data accesses that need to be monitored • Applicability of versioning to explicitly synchronized parallel codes Future Work

Thank You! Thank You!

Main Slides Architectural Support for Synchronization-Free Deterministic Parallel Programming Motivation Versioning Architectural Support Programming Support Evaluation Conclusions and Future Work Backup Slides • ROKO in More Details: • Software • Hardware • Evaluation in More Detail: • Overheads • Memory Contention • TBB Comparison • Versioning in More Details: • Concurrent Reads • Variable-granularity • Algorithms • Programming Support

23 Concurrent Reads / 25 In addition to the acquire and release numbers, each task also maintains a delta number for each shared memory address it accesses ACQ REL Δ 22 42 CPU 1 CPU 2 ACQ REL Δ x x 0 22 W 36 42 R/2 x 0 - R/2 Every shared memory location is assigned a write number and a read number Read access Write access 1 2 write read 0 22 14 0 x ACQ == write ? ACQ == write ? Δ+ read == R ? Concurrent Reads

22 36 24 Variable-granularity Versioning / 25 A task may use a different version number during acquire and release ACQ ACQ ACQ REL REL REL a a a r r r s.b s W/2 0 W/2 W - 2 - 1 1 1 1 2 1 1 1 s.b s.a s W/2 0 0 W/2 W/2 W/2 CPU 2 CPU 3 CPU 1 2 0 Version numbers are identified with an ID 1 0 Variable-granularity

Versioning Algorithms (1) Algorithms

Versioning Algorithms (2) • Before a task is spawn, its access table (AT) is created • Maps the variables reported with the access pragma to memory locations • For recursive accesses, the actual corresponding memory locations are identified by traversing the structures in memory, until a NULL pointer is encountered Algorithms

Versioning Algorithms (3) Algorithms

Describing Function’s Side-Effects • Describing pointer accesses • Describing array accesses • Describing recursive accesses • Some access may be omitted • Read-only data • Structure entry points Programming Model

ROKO Software • Precompiler • Bare-C Cross Compilation System (BCC) • GNU C/C++ crosscompiler (3.4.4), • GNU Binutils-2.16.1 • Newlib 1.13.0 Embedded C-library. • Scheduler • A task is scheduled in its respective sequential order • On a version mismatch: • If there exists a currently non-running sequentially earlier task, switch • Otherwise, continue by re-executing the faulting memory instruction ROKO in Details

ROKO Hardware • LVT entry = 128 bits ROKO in Details

Versioning Overheads Overheads

Benchmarks Overheads

Task Management Overheads (1) Overheads

Task Management Overheads (2) Overheads

Task Management Overhead (3) Overheads

Memory Contention Overheads

TBB Comparison • Linux 2.6.21.1 • Issues: • Virtual memory, library functions (e.g., memory allocator and math functions) and process scheduling • Effort: • Static compilation with the same math library, real-time scheduling in Linux • Sequential execution times where within 5% • Health 23.3% increase • Union 111.9% increase • We suspect this is due to malloc Evaluation

Parallelization Evaluation

Bloom filter • Fast structure • Holds only memory locations for which the comparison of version numbers must be done • Probabilistic • May report false positives • May indicate that the LVT needs to be accessed when not necessary • But never reports false negatives • A CBF miss always means that the memory access can be executed Architectural Support

Architectural Support for Synchronization-Free Deterministic Parallel Programming

Architectural Support for Synchronization-Free Deterministic Parallel Programming

Presentation Transcript

Parallel Programming

PARALLEL programming

DETERMINISTIC DYNAMIC PROGRAMMING 1

Transactional Memory: Architectural Support for Lock-Free Data Structures

Parallel Programming and Synchronization

Potential for parallel computers/parallel programming

Parallel Programming and Synchronization

Transactional Memory: Architectural Support for Lock-Free Data Structures

Parallel Programming

Parallel Programming

Synchronization Transformations for Parallel Computing

Parallel Programming

A Case for Language Support for Implicitly Parallel Programming

Synchronization Programming

Synchronization Methods for Multicore Programming

Kernel Support for Synchronization