470 likes | 572 Views
COM503 Parallel Computer Architecture & Programming. Lecture 4. Memory Consistency Models. Prof. Taeweon Suh Computer Science Education Korea University. Memory Consistency. What do you expect from the following code?. Processor 1. Processor 2. A = 1 flag = 1. while (flag == 0)
E N D
COM503 Parallel Computer Architecture & Programming Lecture 4. Memory Consistency Models Prof. Taeweon Suh Computer Science Education Korea University
Memory Consistency • What do you expect from the following code? Processor 1 Processor 2 A = 1 flag = 1 while (flag == 0) print A • Program orders in P1 and P2’s accesses to different locations are not implied nor enforced by coherence • Coherence requires that the new value for A eventually become visible to process P2 (not necessarily before the new value of flag is observed) • Note that x86 CPU is a superscalar with OOO (Out-Of-Order) execution • What would you do if you want “print A” to print “1”?
Demo #include <stdio.h> #include <omp.h> int main() { inta, b; inta_tmp, b_tmp; a = 0; b = 0; #pragmaomp parallel num_threads(2) shared(a, b) { //printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); #pragmaomp single nowait { a = 1; b = 2; //while(1); } #pragmaomp single nowait { a_tmp = a; b_tmp = b; printf("A = %d, B = %d\n", a_tmp, b_tmp); //while(1); } } return 0; }
Memory Consistency • Use barrier Processor 1 Processor 2 A = 1 print A Barrier (b1) • A barrier is often built using reads and writes to ordinary shared variables (e.g., b1 above) rather than a special barrier operation • Coherence does not say anything at all about the order among these accesses • It would be interesting to see how OpenMP (or Pthreads) implements barrier in low level • But, CPU typically provides barrier instructions (such as sfence, lfence, mfence in x86)
Memory Consistency • So, clearly we need something more than coherence to give a shared address space a clear semantics • That is, an ordering model that programmers can use to reason about the possible results and hence the correctness of their programs • Memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e., to become visible to the processors) with respect to one another • It includes operations to the same locations or to different locations and by the same process or different processes, so in this sense memory consistency subsumes coherence Processor 1 Processor 2 A = 1 print A Barrier (b1)
Programmer’s Abstraction of Memory Subsystem Partial order Partial order Partial order Processors are issuing memory references as per program order ….. P1 P2 Pn ● ● ● ● ● The “switch” is randomly set after each memory reference Memory Interleaving the partial (program) orders for different processes may yield a large number of possible total orders
Sequential Consistency • Sequential consistency (SC) • Formalized by Lamport in 1979 • A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program • Implementing SC requires that the system (s/w and h/w) follow 2 constraints • Program order requirement: memory operations of a process must appear to become visible (to itself and others) in program order • Write atomicity: all writes (to any location) should appear to all processors to have occurred in the same order
Sequential Consistency Processor 1 Processor 2 /* Assume initial values of A and B are 0 */ (1a) A = 1 (1b) B = 2 (2a) print B (2b) print A • What values of A and B do you expect to be printed on P2? (A B) = (0, 0)? (A B) = (1, 2)? (A B) = (1, 0)? (A B) = (0, 2)? • Under SC, the result (0, 2) for (A, B) would not be allowed since it would then appear that the writes of A and B by P1 executed out of program order • Execution order of 1b, 2a, 2b, and 1a is not sequentially consistent
How to Impose Constraint? • In practice, to constrain the compiler optimizations, multithreaded and parallel programs annotate variables or memory references that are used to preserve orders • A particularly stringent example is the use of the volatile qualifier in a variable declaration • It prevents the variable from being register allocated or any memory operation on the variable from being reordered with respect to operations before or after it in program order
Reordering Impact Example • How would reordering the memory operations affect semantics in a parallel program running on a multiprocessor and in a threaded program in which the two processes are interleaved on the same processor? Processor 1 Processor 2 A = 1 flag = 1 while (flag == 0) print A • The compiler may reorder the writes to A and flag with no impact on a sequential program • It violates our intuition for both parallel programs and multithreaded uniprocessor programs • For many compilers, these reorderings can be avoided by declaring the variable flag to be of type volatile integer (instead of integer)
Problems with SC • The SC model provides an intuitive semantics to the programmer • The program order and a consistent interleaving across processes can be quite easily implemented • However, its drawback is that it restricts many of the performance optimizations that modern uniprocessor compilers and microprocessors employ • With the high cost of memory access latency, computer systems achieve higher performance by reordering or overlapping the multiple memory or communication operations from a processor • Preserving the sufficient conditions for SC does not allow for much reordering or overlap in hardware • With SC, the compiler can not reorder memory accesses even if they are to different locations, disallowing critical performance optimizations such as code motion, common-subexpression elimination, software pipelining, and even register allocation
Reality Check • Unfortunately, many of the optimizations that are commonly employed in both compilers and processors violate the SC property • Explicitly parallel programs use uniprocessor compilers, which are concerned only about preserving dependences to the same location • So, compliers routinely reorder accesses to different locations within a process, so a processor may in fact issue accesses out of the program order seen by the programmer • Advanced compiler optimizations can change the order in which different memory locations are accessed or can even eliminate memory operations • Common subexpression elimination, constant propagation, register allocation, and loop transformations such as loop splitting, loop reversal, and blocking
Example: Register Allocation • How can the register allocation lead to a violation of SC even if the hardware satisfies SC P 1 P2 P 1 P2 r2 = 0 B = 1 v = r2 A = r2 A = 0 B = 1 v = A r1 = 0 A = 1 u = r1 B = r1 B = 0 A = 1 u = B • The result (u, v) = (0, 0) is disallowed under SC • A uniprocessor compiler might easily perform these optimizations in each process • They are valid for sequential programs since the reordered accesses are to different locations
Problems with SC • Providing SC at the programmer’s interface implies supporting SC at lower-level interfaces • If the sufficient conditions for SC are met, a processor waits for an access to complete before issuing the next one • So, most of the latency suffered by memory references is directly seen by processors as stall time • Although a processor may continue executing non-memory instructions while a single outstanding memory reference is being serviced, the expected benefit from such overlap is tiny, since even without ILP (Instruction-Level Parallelism) every third instruction on average is a memory reference • So, we go to do something about this performance problem Programmer’s interface: We focus mainly on the consistency model as seen by the programmer. That is, at the interface between the programmer and the rest of the system composed of the compiler, operating system, hardware. For example, a processor may preserve all program orders presented to it among memory operations, but if the compiler has already reordered operations, then programmers can no longer reason with the simple model exported by the hardware
Solutions? • One approach is to preserve SC at the programmer’s interface, but find ways to hide the long stalls from the processor • 1st technique • Compiler does not reorder memory operations, but latency tolerance techniques such as data prefetching or multithreading are used to overlap data transfer with one another or with computation • But, the actual read and write operations are not issued before previous ones complete in program order
Solutions? • 2nd technique • Compiler reorders operations as long as it can guarantee that SC will not be violated in the results • Compiler algorithms have been developed for this (Shasha and Snir 1988, Kris and Yelick 1994, 1995) • At the hardware level, • Memory operations are issued and executed out of program order, but are guaranteed to become visible to other processors in program order • This approach is well suited to dynamically scheduled processors that use instruction lookahead buffer to find independent instructions to issue • Instructions are inserted in the lookahead buffer in program order • They are guaranteed to retire from the lookahead buffer in program order • Speculative execution such as Branch prediction • Speculative reads • Values returned by reads are used even before they are known to be correct • Later, roll back if they are incorrect • Or • Change the memory consistency model itself!
Relaxed Consistency Models • A completely different way to overcome the performance limitations imposed by SC is to change the memory consistency model itself • That is, not to guarantee such strong ordering constraints to the programmer, but still retain semantics that are intuitive enough to be useful • The intuition behind the relaxed models is that SC is usually too conservative • Many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations • By relaxing the ordering constraints, these relaxed consistency models allow the compiler to reorder accesses before presenting them to the hardware, at least to some extent • A the hardware level, they allow multiple memory accesses from the same process not only to be outstanding at a time, but even to complete or become visible out of order, thus allowing much of the latency to be overlapped and hidden from the processor
Example Ordering necessary for correct program semantics Ordering Under SC • Writes to variables A and B by P1 can be reordered without affecting the results • All we must ensure is that both of them complete before the variable flag is set to 1 • Reads to variables A and B can be reordered at P2 once flag has been observed to change to value 1 • Even with these reorderings, the results look just like those of an SC execution P 1 P2 P 1 P2 While (flag ==0) While (flag ==0) While (flag ==0) u = A v = B While (flag ==0) u = A v = B A = 1 B = 1 flag = 1 A = 1 B = 1 flag = 1
Reality Check • It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics and allow the others to be violated for higher performance (Shasha and Snir 1998) • However, the problem is intractable (in fact, undecidable) for general programs, and inexact solutions are often too conservative to be very useful
Relaxed Consistency Model • A relaxed consistency model requires 2 things • What program orders among memory operations are guaranteed to be preserved by the system, including that write atomicity will be maintained • If not all program orders are guaranteed to be preserved by default, then what mechanisms the system provides for a programmer to enforce order explicitly when desired • As should be clear by now, the compiler and the hardware have their own system specifications, but we focus on the specification that the two together (or the system as a whole) presents to the programmer • For a processor architecture, the specification it exports governs the reorderings that it allows and it also provides the order-preserving primitives • It is often called the processor’s memory model
Relaxed Consistency Model • A programmer may use the consistency model to reason about correctness and insert the appropriate order-preserving mechanisms • However, this is a very low-level interface for a programmer • Parallel programming is challenging enough without having to think about reorderings and write atomicity • What programmer wants is a methodology for writing “safe” programs • So, this is a contract: if the program follows certain high-level rules or provides enough program annotations (such as synchronization), then any system on which program runs will always guarantee a sequentially consistent execution, regardless of the default orderings permitted by the system specifications
Relaxed Consistency Model • The programmer’s responsibility is to use the rules and annotations, which hopefully does not involve reasoning at the level of potential orderings • The system’s responsibility is to use the rules and annotations as constraints to maintain the illusion of sequential consistency
Ordering Specifications • TSO (Total Store Ordering) • Sindhu, Frailong, and Cekleov 1991, Sun Microsystems • PC (Processor Consistency) • Goodman 1989 and Gharachorloo 1990, Intel Pentium • PSO (Partial Store Ordering) • Sindhu, Frailong, and Cekleov 1991, Sun Microsystems • WO (Weak Ordering) • Dubois, Scheurich, and Briggs 1986 • RC (Release Consistency) • Gharachorloo 1990 • RMO (Relaxed Memory Ordering) • Weaver and Germond 1994, Sun Sparc V8 and V9 • Digital Alpha (Sites 1992) and IBM/Motorola PowerPC (May et al. 1994) models
1. Relaxing the Write-to-Read Program Order • The main motivation is to allow the hardware to hide the latency of write operations • While the write miss is still in the write buffer and not yet visible to other processors, the processor can issue and complete reads that hit its cache • The models (TSO and PC) in this class preserve the programmer’s intuition quite well, for the most part, even without any special operations • TSO and PC allow a read to bypass an earlier incomplete write in program order • TSO and PC preserve the ordering of writes in program order • But, PC does not guarantee write atomicity
Write Atomicity • Write atomicity ensures that nothing a processor does after it has seen the new value produced by a write (e.g. another write that it issues) becomes visible to other processes before they too have seen the new value for that write • All writes (to any location) should appear to all processors to have occurred in the same order • Write serialization says that writes to the same locationshould appear to all processors to have occurred in the same order
Write Atomicity Example • This example illustrates the importance of write atomicity for sequential consistency Processor 3 Processor 1 Processor 2 A = 1; while (A == 0); B = 1; while (B == 0); print A; What happens if P2 writes B before it is guaranteed that P3 has seen the new value of A?
Example Code Sequences • SC is guaranteed in TSO and PC? (a) (b) P1 P2 P1 P2 A = 1; flag = 1; while (flag == 0); print A; A = 1; B = 1; print B; print A; (c) (d) P1 P3 P2 P1 P2 while (B == 0); print A; A = 1; A = 1; print B; while (A == 0); B = 1; B =1; print A; A popular software-only mutual exclusion algorithm called Dekker’s algorithm(which is used in the absence of hardware support for atomic read-modify-write operations) relies on the property that both A and B will not be read as 0 in (d)
How to Ensure SC Semantics? • To ensure SC semantics when desired (e.g., to port a program written under SC assumptions to a TSO or PC system), we need mechanisms to enforce 2 types of extra orderings • A read does not complete before an earlier write in program order (applies to both TSO and PC) • Sun’s Sparc V9 provides memory barrier (MEMBAR) or fence instructions of different flavors that can ensure any desired ordering • MEMBAR prevents any read that follows it in program order from issuing before all writes that precede it have completed • On architectures that do not provide memory barrier instructions, it is possible to achieve this effect by substituting an atomic read-modify-write operation or sequence for the original read • A read-modify-write is treated as being both a read and a write, so it cannot be reordered with respect to previous writes in these models • Write atomicity for a read operation (applied to PC) • Replacing a read with a read-modify-write also guarantees write atomicity at that read on machines supporting the PC model • Refer to Adve et al, 1993 referenced in the textbook
2. Relaxing the W-R and W-W Program Orders • It allows writes and reads to bypass earlier writes (to different locations) • It enables multiple write misses to be fully overlapped and to become visible out of program order • Sun’s Sparc’s Partial Store Ordering (PSO) model belongs to this category • The only additional instruction we need over TSO is one that enforces w-w ordering in a process’s program order • In Sun’s Sparc V9, it can be achieved by using a MEMBAR instruction • Sun’s Sparc V8 provides a special instruction called store barrier (STBAR) to achieve this
3. Relaxing All Program Orders • No program orders are guaranteed by default • These models are particularly well matched to superscalar processors whose implementation allows for proceeding past read misses to other memory locations • Prominent models in this category • Weak ordering (WO): WO is the seminal model • Release consistency (RC) • Sparc V9 relaxed memory ordering (RMO) • Digital Alpha model • IBM PowerPC model
Weak Ordering (WO) • The motivation of WO is quite simple • Most parallel programs use synchronization operations to coordinate accesses to data when necessary • Between synchronization operations, they do not rely on the order of accesses being preserved P1, P2, … Pn ... Lock (TaskQ) newTask→next = Head; if (Head != NULL) Head→prev = newTask; Head = newTask; UnLock(TaskQ) ...
Illustration of WO Read/Write … Read/Write Block 1 Sync (Acquire) Read, write and read-modify-write operations in blocks 1, 2, and 3 can be arbitrarily reordered within its block Read/Write … Read/Write Block 2 Sync (Release) Read/Write … Read/Write Block 3
Weak Ordering (WO) • The intuitive semantics are not violated by any program reorderings as long as synchronization operations are not reordered with respect to data accesses • Sufficient conditions to ensure a WO system • Before a synchronization operation is issued, the processor waits for all previous operations in program order to have completed • Similarly, memory accesses that follow the synchronization operation are not issued until the synchronization operation completes • When synchronization operations are infrequent, as in many parallel programs, WO typically provides considerable reordering freedom to the hardware and compiler
Release Consistency (RC) • Improvement from WO • Acquire can be reordered with memory accesses in block 1 • The purpose of an acquire is to delay memory accesses in block 2 until the acquire completes • No reason to wait for block 1 to complete before the acquire can be issued • Release can be reordered with memory accesses in block 3 • The purpose of a release is to grant access to the new data that are modified before the release in program order • No reason to delay processing block 3 until the release has completed Read/Write … Read/Write Sync (Acquire) 1 Read/Write … Read/Write 2 Read/Write … Read/Write Sync (Release) 3
Memory Barriers of Commercial Processors • Processors provide specific instructions called memory barriers or fences that can be used to enforce orderings • Synchronization operations (or acquires or releases) cause the compiler to insert the appropriate special instructions or the programmer can insert these instructions directly • Alpha supports 2 kinds of fence instructions: the memory barrier (MB) and the write memory barrier (WMB) • The MB fence is like a synchronization operation in WO • It waits for all previously issued memory accesses to complete before issuing any new accesses • The WMB fence imposes program order only between writes • Thus, a read issued after a WMB can still bypass a write access issued before the WMB
Memory Barriers of Commercial Processors • The Sparc V9 RMO provides a fence or MEMBAR instruction with 4 flavor bits associated with it • Each bit indicates a particular type of ordering to be enforced between previous and following load-store operations • The 4 possibilities are R-R, R-W, W-R and W-W • Any combinations of these bits can be set, offering a variety of ordering choices • The IBM PowerPC mode provides only a single fence instruction called SYNC, that is equivalent to Alpha’s MB fence
Programmer’s Interface • A program running “correctly” on a system with TSO (with enough memory barriers) will not necessarily work “correctly” on a system with WO • Programmer • Programmers ensure that all synchronization operations are explicitly labeled or identified • For example, LOCK, UNLOCK and BARRIER • System (compiler and hardware) • The compiler or run-time library translates these synchronization operations into the appropriate order-preserving operations (memory barrier or fences) • Then, the system (compiler plus hardware) guarantees sequentially consistent executions even though it may reorder operations between synchronization operations
A Typical Memory Hierarchy • Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology lower level higher level Secondary Storage (Disk) On-Chip Components Main Memory (DRAM) CPU Core L2 (Second Level) Cache L1I (Instr Cache) ITLB Reg File L1D (Data Cache) DTLB Note that the cache coherence hardware updates or invalidates only the memory and the caches (not the registers of CPU)
The Memory Hierarchy: Why Does It Work? • Temporal Locality (locality in time) • If a memory location is referenced, then it will tend to be referenced again soon Keep most recently accessed data items closer to the processor • Spatial Locality (locality in space) • If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon Move blocks consisting of contiguous words closer to the processor
D C[99] C[98] C[97] C[96] . . . . . . . . . . . . . . C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0] . . . . . . . . . . . . . . B[11] B[10] B[9] B[8] B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0] A[99] A[98] A[97] A[96] . . . . . . . . . . . . . . A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0] Example of Locality int A[100], B[100],C[100],D; for (i=0; i<100; i++) { C[i] = A[i] * B[i] + D; } A Cache Line (block) Slide from Prof Sean Lee in Georgia Tech
Volatile • When would you use a variable declaration with volatile , for example, in C?