270 likes | 440 Views
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel. ECE1747 – Parallel Programming Vicky Tsang. Background. Published in the Journal of Parallel and Distributed Computing, vol. 60 (12), pp. 1512-1530, December 2000 Work to further improve TreadMarks
E N D
OpenMP for Networks of SMPsY. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang
Background • Published in the Journal of Parallel and Distributed Computing, vol. 60 (12), pp. 1512-1530, December 2000 • Work to further improve TreadMarks • Presents an alternative solution to MPI
Roadmap • Motivation • Solution • OpenMP API • TreadMarks • OpenMP Translator • Performance Measurement • Results • Conclusion
Motivation • To enable the programmer to reply on a single, standard, shared-memory API for parallelization within and between multiprocessors. • To provide another standard other than MPI?
Solution • Presents the first system that implements OpenMP on a network of shared-memory multiprocessors • Implemented via a translator converting OpenMP directives to calls in modified TreadMarks • Modified TreadMarks uses POSIX threads for parallelism within an SMP node
Solution • Original version of TreadMarks: • A Unix process was executed on each processor of the multiprocessor node and communication between processes was achieved through message passing • Fails to take advantage of hardware shared memory
Solution • Modified version of TreadMarks • POSIX threads used to implement parallelism • OpenMP threads within a multiprocessor share a single address space • Positive: • Reduces the number of changes to TreadMarks to support multithreading on a multiprocessor • OS maintains the coherence of page mappings automatically • Negative: • More difficult to provide uniform sharing of memory between threads on the same node and threads on different nodes
OpenMP API • Three kinds of directives: • Parallelism/work sharing • Data environment • Synchronization • Based on a fork-join model • Sequential code sections executed by master thread • Parallel code sections are executed by all threads, including the master thread
OpenMP API • Parallel directive – all threads perform the same computation • Work sharing directive – computation is divided among the threads • Data environment directive – control the sharing of program variables • Synchronization directive – control the synchronization between threads
TreadMarks • User-level SDSM system • Provides a global shared address space on top of physically distributed memories • Key functions performed are memory coherence and synchronization
TreadMarks – Memory Coherence • Minimize the amount of communication performed to maintain memory consistency by: • a lazy implementation of release consistency • reducing the impact of false sharing by allowing multiple concurrent writers to modify a page • Propagation of consistency information is postponed until the time of an acquire
TreadMarks - Synchronization • Barrier implemented as acquire and release messages • Governed by a centralized manager
TreadMarks – Modifications for OpenMP • Inclusion of two primitives: • Tmk_fork • Tmk_join • All threads created at the start of a program’s execution to minimize overhead. • Slave threads are blocked during sequential execution until the next Tmk_fork is issued by the master thread.
TreadMarks – Modifications for Networks of Multiprocessors • POSIX thread enabled sharing of data between processors. Addition of some data structures, such as message buffers, in thread-private memory for data that is to remain private within a thread. • A per-page mutex was added to allow greater concurrency in the page fault handler. • Synchronization functions in TreadMarks were modified to use POSIX thread-based synchronization between processors within a node and existing TreadMarks synchronization functions between nodes. • A second mapping was added for the memory that is shared between nodes so shared-memory pages can be updated while the first mapping remains invalid until the update is complete. This reduces the number of page protection operations performed by TreadMarks.
OpenMP Translator • Synchronization directives translate directly to TreadMarks synchronization operations. • The complier translates the code sections marks with parallel directives to fork-join code. • Data environment directives implemented to work with both TreadMarks and POSIX threads, hiding the interface issues from the programmer.
Performance Measurement • Platform • IBM SP2 consisting of four SMP nodes • Per node: • Four IBM PowerPC 604 processors • 1 GB memory • Running AIX 4.2
Performance Measurement • Applications • SPLASH-2 Barnes-Hut • NAS 3D-FFT • SPLASH-2 CLU • SPLASH-2 Water • Red-Black SOR • TSP • Modified Gramm-Schmidt (MGS)
Conclusion • Enables the programmer to rely on a single, standard, shared-memory API for parallelization within and between multiprocessors. • Using shared hardware memory reduced data and messages transmitted. • The speedups of multithreaded TreadMarks codes on four four-way SMP SP2 nodes are within 7-30% of the MPI versions.
Critique • Solution allows easier implementation of program parallelization across multiprocessors if speedup is not crucial • OpenMP is easier on the programmer but speedup still not as good as MPI
Critique • Issues: • AIX has inefficient implementation of page protection • Paper claims that every other brand of Unix, including Linux, uses data structures that handle mprotect operations more efficiently • Why wasn’t the solution implemented on another platform? • Paper failed to present a big motivation for using this solution over MPI.