480 likes | 590 Views
TreadMarks: Shared Memory Computing on Networks of Workstations. C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel Rice University. INTRODUCTION.
E N D
TreadMarks: Shared Memory Computing on Networks of Workstations C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony,W. Yu, W. ZwaenepoelRice University
INTRODUCTION • Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space • Key issue in building a software DSM is minimizing the amount of data communication among the workstation memories
Why bother with DSM? • Key idea is to build fast parallel computers that are • Cheaper than shared memory multiprocessor architectures • As convenient to use
CACHE CACHE CACHE CACHE Conventional parallel architecture CPU CPU CPU CPU Shared memory
Today’s architecture • Clusters of workstations are much more cost effective • No need to develop complex bus and cache structures • Can use off-the-shelf networking hardware • Gigabit Ethernet • Myrinet (1.5 Gb/s) • Can quickly integrate newest microprocessors
Limitations of cluster approach • Communication within a cluster of workstation is through message passing • Much harder to program than concurrent access to a shared memory • Many big programs were written for shared memory architectures • Converting them to a message passing architecture is a nightmare
Distributed shared memory main memories DSM = one shared global address space
Distributed shared memory • DSM makes a cluster of workstations look like a shared memory parallel computer • Easier to write new programs • Easier to port existing programs • Key problem is that DSM only provides the illusion of having a shared memory architecture • Data must still move back and forth among the workstations
Munin • Developed at Rice University • Based on software objects (variables) • Used the processor virtual memory to detect access to the shared objects • Included several techniques for reducing consistency-related communication • Only ran on top of the V kernel
Munin main strengths • Excellent performance • Portability of programs • Allowedprograms written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes(dusty decks)
Munin main weakness • Very poor portability of Munin itself • Depended of some features of the V kernel • Not maintained since the late 80's
TreadMarks • Provides DSM as an array of bytes • Like Munin, • Uses release consistency • Offers a multiple writer protocol to fight false sharing • Runs at user-level on a number of UNIX platforms • Offers a very simple user interface
First example: Jacobi iteration • Illustrates the use of barriers • A barrier is a synchronization primitive that forces processes accessing it to wait until all processes have reached it • Forces processes to wait until all of them have completed a specific step
Proc 0 … Proc n-1 Jacobi iteration: overall organization • Operates on a two-dimensional array • Each processor works on a specific band of rows • Boundary rows are shared
Jacobi iteration: overall organization • During each iteration step, each array element is set to the average of its four neighbors • Averages are stored in a scratch matrix and copied later into the shared matrix
Jacobi iteration: the barriers • Mark the end of each computation phase • Prevents processes from continuing the computation before all other processes have completed the previous phase and the new values are "installed" • Include an implicit release() followed by an implicit acquire() • To be explained later
Jacobi iteration: declarations #define M #define N float *grid // shared array float scratch[M][N] // private array
Jacobi iteration: startup main() { Tmk_startup(); if (Tmk_proc_id == 0 ) { grid = Tmk_malloc(M*N*sizeof(float)); initialize grid; } // if Tmk_barrier(0); length = M/Tmk_nprocs; begin = length*Tmk_proc_id; end = length*(Tmk_proc_id + 1);
Jacobi iteration: main loop for (number of iterations) { for (i = begin; i < end; i++) for (j = 0; j < N; j++) scratch[i][j] = (grid[i-][j] + … + grid[i][j+1])/4; Tmk_barrier(1); for (i = begin; i < end; i++) for (j = 0; j < N; j++) grid[i][j] = scratch[i][j]; Tmk_barrier(2); } // main loop } // main
Second example: TSP • Traveling salesman problem • Finding the shortest path through a number of cities • Program keeps a queue of partial tours • Most promising at the end
TSP: declarations queue_type *Queue int *Shortest_length int queue_lock_id, min_lock_id;
TSP: startup main ( Tmkstartup() queue_lock_id = 0; min_lock_id = 1; if (Tmk_proc_id == 0) { Queue = Tmk_malloc(sizeof(queuetype)); Shortest_length = Tmk_malloc(sizeof(int)); initialize Heap and Shortest_length; ] // if Tmk_barrier (0);
TSP: while loop while (true) do { Tmk_lock_acquire(queue_lock_id); if (queue is empty) { Tmk_lock_release(queue_lock_id); Tmk_exit(); } // while loop Keep adding to queue until a long promising tour appears at the head Path = Delete the tour from the head Tmk_lock_release(queue_lock_id); } // while
TSP: end of main length = recursively try all cities not on Path, find the shortest tour length Tmk_lock_acquire(min_lock_id); if (length < Shortest_length) Shortest_length = length Tmk_lock_release(min_lock_id } // main
Critical sections • All accesses to shared variables are surrounded by a pair Tmk_lock_acquire(lock_id); … Tmk_lock_relese(lock_id);
Implementation Issues • Consistency issues • False sharing
Consistency model (I) • Shared data are replicated at times • To speed up read accesses • All workstations must share a consistent view of all data • Strict consistency is not possible
Consistency model (II) • Various authors have proposed weaker consistency models • Cheaper to implement • Harder to use in a correct fashion • TreadMarks usessoftware release consistency • Only requires the memory to be consistent at specific synchronization points
SW release consistency (I) • Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables • P(&mutex) and V(&mutex) • lock(&csect) and unlock(&csect) • acquire( ) and release( ) • Unprotected accesses can produce unpredictable results
SW release consistency (II) • SW release consistency will only guarantee correctness of operations performed within a request/release pair • No need to export the new values of shared variables until the release • Must guarantee that workstation has received the most recent values of all shared variables when it completes a request
shared int x; acquire( ); x = 1;release ( ); // export x=1 shared int x; acquire( );// wait for new value of x x++;release ( ); // export x=2 SW release consistency (III)
SW release consistency (IV) • Must still decide how to release updated values • TreadMarks uses lazy release: • Delays propagation until an acquire is issued • Its predecessor Munin used eager release: • New values of shared variables were propagated at release time
SW release consistency (V) Eager release Lazy release
False sharing accesses y accesses x x y page containing x and y will move back and forthbetween main memories of workstations
Multiple write protocol (I) • Designed to fight false sharing • Uses a copy-on-write mechanism • Whenever a process is granted access to write-shared data, the page containing these data is marked copy-on-write • First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).
Multiple write protocol (II) • At release time, TreadMarks • Performs a word by word comparison of the page and its twin • Stores the diff in the space used by the twin page • Informs all processors having a copy of the shared data of the update • These processors will request the diff the first time they access the page
Example Before First write access x = 1 y = 2 x = 1 y = 2 twin After Compare with twin x = 3 y = 2 New value of x is 3
Multiple write protocol (III) • TreadMarks could but does not check for conflicting updates to write-shared pages
The TreadMarks system • Entirely at user-level • Links to programs written in C, C++ and Fortran • Uses UDP/IP for communication (or AAL3/4 if machines are connected by an ATM LAN) • Uses SIGIO signal to speed up processing of incoming requests • Uses mprotect( ) system call to control access to shard pages
Performance evaluation (I) • Long discussion of two large TreadMarks applications
Performance evaluation (II) • A previous paper compared performance of TreadMarks with that of Munin • Munin performance typically was within 5 to 33% of the performance of hand-coded message passing versions of the same programs • TreadMarks was almost always better than Munin with one exception: • A 3-D FFT program
Performance Evaluation (III) • 3-D FFT program was an iterative program that read some shared data outside any critical section • Doing otherwise would have been to costly • Munin used eager release, which ensured that the values read were not far from their true value • Not true for TreadMarks!
Other DSM Implementations (I) • Sequentially-Consistent Software DSM (IVY): • Sends messages to other copies at each write • Much slower • Software release consistency with eager release (Munin)
Other DSM Implementations (II) • Entry consistency (Midway): • Requires each variable to be associated to a synchronization object (typically a lock) • Acquire/release operations on a given synchronization object only involve the variables associated with that object • Requires less data traffic • Does not handle well dusty decks
Other DSM Implementations (III) • Structured DSM Systems (Linda): • Offer to the programmer a shared tuple space accessed using specific synchronized methods • Require a very different programming style
CONCLUSIONS • Can build an efficient DSM entirely in user space • Modern UNIX systems offer all the required primitives • Software release consistency model works very well • Lazy release is almost always better than eager release