320 likes | 878 Views
Shared Memory Systems. Miodrag Bolic. Next 4 to 5 lectures. Overview. Outline. Characteristics of shared memory systems Programming shared-memory multiprocessors Hardware implementation Architectures Memory access Caches Synchronization. Characteristics of shared memory systems [3].
E N D
Shared Memory Systems Miodrag Bolic
Next 4 to 5 lectures Overview
Outline • Characteristics of shared memory systems • Programming shared-memory multiprocessors • Hardware implementation • Architectures • Memory access • Caches • Synchronization
Characteristics of shared memory systems [3] • Any processor can directly reference any memory location. • Communication occurs implicitly as result of loads and stores. • Location of data in memory is transparent to the programmer. • Inherently provided on wide range of platforms (standard processors today have specific extra hardware for share memory systems) • Memory may be physically distributed among processors.
Requirements [3] • Support for memory coherency • The machine must make sure that all of the processing nodes have an accurate picture of the most up-to-date memory. • Support for atomic operations on data • The machine must allow for only one processor to change data at a time. • Nonatomic operation: One processor requests data and before the request is answered, another processor changes that data.
Shared Memory Program [3] Sum all the elements of an array of size n. INITIALIZE; //assign proc_nums and num_procs read_array(array_to_sum, size); //read the array and array size from file if (proc_num == 0) //initialize the sum { LOCK(global_sum); global_sum = 0; UNLOCK(global_sum); } local_sum = 0; size_to_sum = size/num_procs; lower_ind = size_to_sum * proc_num; upper_ind = size_to_sum * (proc_num + 1); for (i = lower_ind; i < upper_ind; i++) local_sum += array_to_sum[i]; //if size =100, num_proc=4, processor 0 sums 0 to 24, proc 1 sums 25 to 49, etc LOCK(global_sum); //locks the sum variable so only this process can change it global_sum += local_sum; UNLOCK(global_sum); //gives the sum back so other procs can add to it BARRIER(num_procs); //waits for num_procs to get to this point in the program if (proc_num == 0) printf("sum is %d", global_sum); END;
Multiprocessor Software Functions – Example [3] • INITIALIZE – assigns a number (proc_num) to each processor in the system; assigns the total number of processors (num_procs). • LOCK(data) • Allows a processor to “check out” a certain piece of shared data. • While one processor has the data locked, no other processors can obtain the lock. • The lock is blocking, so once a LOCK is encountered, execution of the program cannot proceed until the LOCK is obtained. • UNLOCK(data) – releases a lock so that other processors can obtain it. • BARRIER(n_procs) – When a BARRIER is encountered, a processor waits at that BARRIER until n_procs processors reach the BARRIER, then execution can proceed.
Architecture [3] An example of shared-bus architecture with 4 processors Both static and dynamic networks can be used to connect processors and shared memory
Caches and Cache Coherence [4] • Caches play key role in all cases • Reduce average data access time • Reduce bandwidth demands placed on shared interconnect • But private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • They’ll keep accessing stale value in their caches • Cache coherence problem • Need to take actions to ensure visibility
Inconsistency in Data Sharing • Suppose two processors each use (read) a data item X from a shared memory. Then each processor’s cache will have a copy of X that is consistent with the shared memory copy. • Now suppose one processor modifies X (to X’). Now that processor’s cache is inconsistent with the other processor’s cache and the shared memory. • With a write-through cache, the shared memory copy will be made consistent, but the other processor still has an inconsistent value (X). • With a write-back cache, the shared memory copy will be updated eventually, when the block containing X (actually X’) is replaced or invalidated.
Mutual Exclusion [4] • Provided by LOCK-UNLOCK around critical section • Set of operations we want to execute atomically • Implementation of LOCK/UNLOCK must guarantee mutual excl. • Can lead to significant serialization if contended • Especially since expect non-local accesses in critical section • Mutex stands for “mutual exclusion”
Simple Software Lock [4] lock: ld register, location/* copy location to register*/ cmp register, #0/* compare with 0 */ bnz lock/* if not 0, try again */ st location, #1/ * store 1 to mark it locked */ ret/* return control to caller */ and unlock: st location, #0/* write 0 to location */ ret/* return control to caller*/ • Problem: lock needs atomicity in its own implementation • Read (test) and write (set) of lock variable by a process not atomic • Solution: atomic read-modify-write or exchange instructions • atomically test value of location and set it to another value, return success or failure somehow
Atomic Exchange Instruction [4] • Specifies a location and register. In atomic operation: • Value in location read into a register • Another value (function of value read or not) stored into location • Many variants • Simple example: test&set • Value in location read into a specified register • Constant 1 stored into location • Successful if value loaded into register is 0 • Other constants could be used instead of 1 and 0 • Can be used to build locks
Simple Test&Set Lock [4] lock: t&s register, location bnz lock/* if not 0, try again */ ret/* return control to caller */ unlock: st location, #0/* write 0 to location */ ret/* return control to caller */ • Other read-modify-write primitives can be used too • Swap
Mutual Exclusion: Altera [1] • A mutex allows cooperating processors to agree that one of them should be allowed mutually exclusive access to a hardware resource in the system. • Component that is added in SOPC builder • Shared memory can be accessed without using mutexes
Example: Opening and locking a mutex [2] #include <altera_avalon_mutex.h> /* get the mutex device handle */ alt_mutex_dev* mutex = altera_avalon_mutex_open( “/dev/mutex” ); /* acquire the mutex, setting the value to one */ altera_avalon_mutex_lock( mutex, 1 ); /* * Access a shared resource here. */ /* release the lock */ altera_avalon_mutex_unlock( mutex );
Performance Criteria (T&S Lock) [4] • Uncontended Latency • Very low if repeatedly accessed by same processor; indept. of p • Traffic • Lots if many processors compete; poor scaling with p • Each t&s generates invalidations, and all rush out again to t&s • Storage • Very small (single variable); independent of p • Fairness • Poor, can cause starvation • Test&set with backoff similar, but less traffic • Luckily, better hardware primitives as well as algorithms exist
Improved Hardware Primitives: LL-SC [4] • Goals: • Test with reads • Failed read-modify-write attempts don’t generate invalidations • Nice if single primitive can implement range of r-m-w operations • Load-Locked (or -linked), Store-Conditional • LL reads variable into register • Follow with arbitrary instructions to manipulate its value • SC tries to store back to location if and only if no one else has written to the variable since this processor’s LL • If SC succeeds, means all three steps happened atomically • If fails, doesn’t write or generate invalidations (need to retry LL) • Success indicated by condition codes;
Simple Lock with LL-SC [4] lock: ll reg1, location /* LL location to reg1 */ bnz reg1, lock /* if location is locked, try again*/ sc location, reg2 /* SC reg2 into location*/ beqz lock /* if failed, start again */ ret /* return controll to the caller of lock */ *-------------------------*/ unlock: st location, #0 /* write 0 to location */ ret • SC can fail (without putting transaction on bus) if: • Tries to get bus but another processor’s SC gets bus first • LL, SC are not lock, unlock respectively • Only guarantee no conflicting write to lock variable between them • But can use directly to implement simple operations on shared variables
A Simple Centralized Barrier [2] • Shared counter maintains number of processes that have arrived • increment when arrive (lock), check until reaches numprocs struct bar_type { int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0;/* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0;/* reset for next barrier */ bar_name.flag = 1;/* release waiters */ } else while (bar_name.flag == 0) {};/* busy wait for release */ }
References • Altera Corp., Creating Multiprocessor NIOS II System Tutorial, 2005. • Altera Corp. ,Altera Embedded Peripherals Handbook, 2005. • J. Kowalczyk, “Multiprocessor Systems,” Xilinx, 2003. • D. Culler, J. P. Singh, Parallel Computer Architectures, A Hardware/Software Approach, Morgan Kaufman, 1999.