410 likes | 484 Views
ECE 1147, Parallel Computation Oct. 30, 2006. Implementation and Performance of Munin (Distributed Shared Memory System). Dongying Li. (Original Authors: J. B. Carter, et al .). Department of Electrical and Computer Engineering University of Toronto. Distributed Shared Memory.
E N D
ECE 1147, Parallel Computation Oct. 30, 2006 Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li (Original Authors: J. B. Carter, et al.) Department of Electrical and Computer Engineering University of Toronto
Distributed Shared Memory • Shared address space spanning the processors of a distributed memory multiprocessor proc1 proc3 proc2 X=0 X=0 X=0 X=0
Distributed Shared Memory shared memory network mem0 mem1 mem2 memN ... proc0 proc1 proc2 procN
Distributed Shared Memory • Design objectives • Good performance comparable to shared memory programs • No significant deviation from shared memory coding model • Low communication and message passing overheads
Munin System • Characterized features • Software released consistency • Multiple consistency protocols • Same interface with shared memory code model • Threads, syncs, data sharing etc. • Deviations • All shared variable annotated by access pattern • Syncs explicitly visible to runtime system (important for release consistency!)
Contents • Basic concepts • Shared object • Software release consistency • Multiple consistency protocols • Software implementation • Prototype overview • Execution process • Advanced programming features • Data object directory and delayed update queue • Synchronization • Performance • Overview of other DSM systems • Conclusion
Basic Concepts • Basic concepts • Shared object • Software release consistency • Multiple consistency protocols • Software implementation • Prototype overview • Execution process • Advanced programming features • Data object directory and delayed update queue • Synchronization • Performance • Overview of other DSM systems • Conclusion
Shared Object 8-kilo 8-kilo 8-kilo x x x y
Software Release Consistency • Sequential Consistency • All processors observe the same order • Must correspond to some serial order • Only ordering constraint is that reads/writes of P1 appear in the same order, but no restrictions on relative ordering between processors. • Synchronous read/write • Writes must be propagated before moving on to the next operation
Software Release Consistency • Special weak consistency protocol • Reduction of message passing overhead • Two categories of shared variable operations • Ordinary access • Read • Write • Synchronization access (lock, semaphore, barrier) • Acquire • Release
Software Release Consistency • Before ordinary access (read, write) allowed, all previous acquire performed • Before release allowed, all previous ordinary access performed • Before acquire allowed, all previous release performed • Before release allowed, all previous acquire performed • In a word, results of writes prior to a release propagated before next processor acquiring this released lock
Release Consistency • Write propagating at release
Multiple Consistency Protocols • No single consistency protocol suitable for all parallelization purpose • Shared variables accessed in different ways within single program • Variable access pattern changes during execution • Multiple protocols allow access pattern-oriented tuning for different shared variables
Multiple Consistency Protocols • High-level sharing pattern annotation • Specified in shared variable declaration • Combinations of low-level protocol parameters • Low-level protocol parameter • Specified in shared variable directory • Specific aspect of protocol
Protocol Parameters • I: propagate invalidating or updating after modification? • R: Replicas allowed in other nodes? • D: Delayed operation (update, invalidation) allowed? • FO: Having fixed owner (no writes at other nodes)? • M: Multiple writers allowed? • S: Stable sharing pattern (accessed by fixed threads)? • FL: Flush changes to owner & invalidate local copy? • W: Writable?
Sharing annotations • Read only • Simplest pattern: once initialized, no further access • Suitable for constant etc. • Migratory • Only one thread can access at one period of time • Suitable for variables accessed only in critical session • Write-shared • Can be written concurrently by multiple threads • Different threads update different words of variable • Producer-consumer • Written only by one threads and read by others • Replicate and update the object, not invalidate
Sharing annotations • Example: producer-consumer for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }
Sharing annotations • Reduction • Accessed by fetching and operation (read, write then release) • Example: min(), a++ • Result • Phase 1: multiple write allowed • Phase 2: one thread (the result) access exclusively • Conventional • Conventional update protocol for shared variables
Sharing annotations w(x) w(x) r(x) w(x) w(x) w(x) w(x) r(x) w(x) w(x)
Software Implementation • Basic concepts • Shared object • Software release consistency • Multiple consistency protocols • Software implementation • Prototype overview • Execution process • Advanced programming features • Data object directory and delayed update queue • Synchronization • Performance • Overview of other DSM systems • Conclusion
Prototype Overview • A simple processor converting annotations to suitable format • A linker creating the shared memory segment • Library routines linked into program • Operating system support for page fault handling and page table manipulation
Execution Process • Compiling Munin processor Sharing annotations Auxiliary files Linker Shared data description table Shared data segment
Execution Process • Initialization Munin root thread user root thread User_init() P1 Code copy Data segment P2 Munin worker thread . . Code copy Data segment Pn Munin worker thread
Execution Process • Synchronization Munin root thread P1 Synchronization operation P2 User thread Munin worker thread . . Pn
Advanced Programming Features • Associate data & Synch rel(m) msg acq(m) r(x) r(x) rel(m) msg w(x) acq(m) r(x)
Advanced Programming Features • PhaseChange() • Change the producer consumer relationship • Example: adaptive mesh sor • ChangeAnnotation() • Change the access pattern in execution • Invalidate() • Flush() • SingleObject() • PreAcquire()
Data Object Directory • Start Address and Size • Protocol parameters • Object state (valid, writable, invalid) • Copyset (which remote has copies) • Synchq (corresponding synchronization object) • Probable owner • Home node • Access control semaphore • Links
Delayed Update Queue rel(m) acq(m) w(x) w(y) x x y
Synchronization • Queue based synchronization • Request – reply – lock forward mechanism • CreateLock(), AcquireLock(), ReleaseLock(), CreateBarrier(), WaitAtBarrier()
Performance • Basic concepts • Shared object • Software release consistency • Multiple consistency protocols • Software implementation • Prototype overview • Execution process • Advanced programming features • Data object directory and delayed update queue • Synchronization • Performance • Overview of other DSM systems • Conclusion
Performance Problem with Munin • Note: inefficient performance for task-queue model! (TSP-Q, quicksort, etc.) • Eg. Speed up with MPI for TSP (16 procs) code I code II 8.9 13.4 • Speed up with Munin code I code II 6.0 8.9 • Major overhead: time for thread waiting at the lock which protects the work queue: caused by transferring whole work queue between threads
Overview of Other DSM System • Basic concepts • Shared object • Software release consistency • Multiple consistency protocols • Software implementation • Prototype overview • Execution process • Advanced programming features • Data object directory and delayed update queue • Synchronization • Performance • Overview of other DSM systems • Conclusion
Overview of Other DSM System • Clouds: per-segment (object) based consistency protocol • Mirage: per-page based • Orca: reliable ordered broadcast protocol • Amber: user responsible for the data distribution among processors • Linda: shared variable in tuple space, atomic operation: insertion, removal, reading • Midway: using entry consistency (weaker consistency than release consistency) • DASH: hardware DSM
Conclusion • Objective: efficient DSM system with similar protocol to shared memory programming and small message passing overhead • Special feature: multiple protocols, software release consistency • Implementation: synchronization realized by Munin root thread and Munin worker threads