600 likes | 606 Views
This programming example explores shared memory machines and the race conditions that arise when multiple processors access the same variable. It also discusses different approaches to building parallel machines and the evolution of shared cache.
E N D
CS 267: Shared Memory MachinesProgrammingExample: Sharks and Fish James Demmel demmel@cs.berkeley.edu www.cs.berkeley.edu/~demmel/cs267_Spr06 CS267 Lecture 4
Basic Shared Memory Architecture • Processors all connected to a large shared memory • Where are caches? P2 P1 Pn interconnect memory • Now take a closer look at structure, costs, limits, programming CS267 Lecture 4
Outline • Evolution of Hardware and Software • CPUs getting exponentially faster than memory they share • Hardware evolves to try to match speeds • Program semantics evolve too • Programs may change from correct to buggy, unless programmed carefully • Performance evolves as well • Well tuned programs today may be inefficient tomorrow • Goal: teach a programming style likely to stay correct, if not always as efficient as possible • Use locks to avoid race conditions • Current research seeks best of both worlds • Example: Sharks and Fish (part of next homework) CS267 Lecture 4
Processor-DRAM Gap (latency) µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time CS267 Lecture 4
Shared Memory Code for Computing a Sums = f(A[0]) + f(A[1]) static int s = 0; Thread 0 s = s + f(A[0]) Thread 1 s = s + f(A[1]) • Might get f(A[0]) + f(A[1]) or f(A[0]) or f(A[1]) • Problem is a race condition on variable s in the program • A race condition or data race occurs when: • two processors (or two threads) access the same variable, and at least one does a write. • The accesses are concurrent (not synchronized) so they could happen simultaneously CS267 Lecture 4
Approaches to Building Parallel Machines P P Scale 1 n Switch (Interleaved) P First-level $ P n 1 $ $ (Interleaved) Main memory Inter connection network Shared Cache Mem Mem Centralized Memory UMA = Uniform Memory Access P P n 1 $ $ Mem Mem Inter connection network Distributed Memory (NUMA = Non-UMA) CS267 Lecture 4
Shared Cache: Advantages and Disadvantages Advantages • Placement of data in shared cache identical to single processor case • Only one copy of any cached block • Can’t have values of same memory location in different caches • Fine-grain sharing is possible • “Good” Interference • One processor may prefetch data for another • Can share data within a cache line without moving line Disadvantages • Bandwidth limitation • “Bad” Interference • One processor may flush another processors data CS267 Lecture 4
Evolution of Shared Cache • Alliant FX-8 (early 1980s) • eight 68020s with x-bar to 512 KB interleaved cache • Encore & Sequent (1980s) • first 32-bit micros (N32032) • two to a board with a shared cache • Disappeared for a while, and then … • Cray X1 shares L3 cache • IBM Power 4, Power 5, BlueGene nodes share L2 cache • If switch and cache on chip, may have enough bandwidth again CS267 Lecture 4
Approaches to Building Parallel Machines P P Scale 1 n Switch (Interleaved) P First-level $ P n 1 $ $ (Interleaved) Main memory Inter connection network Shared Cache Mem Mem Centralized Memory UMA = Uniform Memory Access P P n 1 $ $ Mem Mem Inter connection network Distributed Memory (NUMA = Non-UMA) CS267 Lecture 4
Intuitive Memory Model • Reading an address should return the last value written to that address • Easy in uniprocessors • except for I/O • Cache coherence problem in MPs is more pervasive and more performance critical • More formally, this is called sequential consistency: “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] CS267 Lecture 4
P0 P1 P2 P3 memory Sequential Consistency Intuition • Sequential consistency says the machine behaves as if it does the following CS267 Lecture 4
Memory Consistency Semantics What does this imply about program behavior? • No process ever sees “garbage” values, I.e., average of 2 values • Processors always see values written by some some processor • The value seen is constrained by program order on all processors • Time always moves forward • Example: spin lock • P1 writes data=1, then writes flag=1 • P2 waits until flag=1, then reads data If P2 sees the new value of flag (=1), it must see the new value of data (=1) initially: flag=0 data=0 P1 P2 data = 1 flag = 1 10: if flag=0, goto 10 …= data CS267 Lecture 4
If Caches are Not “Coherent” • Coherence means different copies of same location have same value, incoherent otherwise: • p1 and p2 both have cached copies of data (= 0) • p1 writes data=1 • May “write through” to memory • p2 reads data, but gets the “stale” cached copy • This may happen even if it read an updated value of another variable, flag, that came from memory data = 0 data 1 data 0 data 0 p1 p2 CS267 Lecture 4
State Address Data Snoopy Cache-Coherence Protocols • Memory bus is a broadcast medium • Caches contain information on which addresses they store • Cache Controller “snoops” all transactions on the bus • A transaction is a relevant transaction if it involves a cache block currently contained in this cache • Take action to ensure coherence • invalidate, update, or supply value • Many possible designs (see CS252 or CS258) Pn P0 bus snoop $ $ memory bus memory op from Pn Mem Mem CS267 Lecture 4
Limits of Bus-Based Shared Memory Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate => 80 MB/s inst BW per processor => 60 MB/s data BW per processor • 140 MB/s combined BW Assuming 1 GB/s bus bandwidth \ 8 processors will saturate bus I/O MEM ° ° ° MEM 140 MB/s ° ° ° cache cache 5.2 GB/s PROC PROC CS267 Lecture 4
Sample Machines • Intel Pentium Pro Quad • Coherent • 4 processors • Sun Enterprise server • Coherent • Up to 16 processor and/or memory-I/O cards • IBM Blue Gene/L • L1 not coherent, L2 shared CS267 Lecture 4
Approaches to Building Parallel Machines P P Scale 1 n Switch (Interleaved) P First-level $ P n 1 $ $ (Interleaved) Main memory Inter connection network Shared Cache Mem Mem Centralized Memory UMA = Uniform Memory Access P P n 1 $ $ Mem Mem Inter connection network Distributed Memory (NUMA = Non-UMA)) CS267 Lecture 4
Basic Choices in Memory/Cache Coherence • Keep Directory to keep track of which memory stores latest copy of data • Directory, like cache, may keep information such as: • Valid/invalid • Dirty (inconsistent with memory) • Shared (in another caches) • When a processor executes a write operation to shared data, basic design choices are: • With respect to memory: • Write through cache: do the write in memory as well as cache • Write back cache: wait and do the write later, when the item is flushed • With respect to other cached copies • Update: give all other processors the new value • Invalidate: all other processors remove from cache • See CS252 or CS258 for details CS267 Lecture 4
SGI Altix 3000 • A node contains up to 4 Itanium 2 processors and 32GB of memory • Network is SGI’s NUMAlink, the NUMAflex interconnect technology. • Uses a mixture of snoopy and directory-based coherence • Up to 512 processors that are cache coherent (global address space is possible for larger machines) CS267 Lecture 4
Cache Coherence and Sequential Consistency • There is a lot of hardware/work to ensure coherent caches • Never more than 1 version of data for a given address in caches • Data is always a value written by some processor • But other HW/SW features may break sequential consistency (SC): • The compiler reorders/removes code (e.g., your spin lock) • The compiler allocates a register for flag on Processor 2 and spins on that register value without ever completing • Write buffers (place to store writes while waiting to complete) • Processors may reorder writes to merge addresses (not FIFO) • Write X=1, Y=1, X=2 (second write to X may happen before Y’s) • Prefetch instructions cause read reordering (read data before flag) • The network reorders the two write messages. • The write to flag is nearby, whereas data is far away. • Some of these can be prevented by declaring variables “volatile” • Most current commercial SMPs give up SC • A correct program on a SC processor may be incorrect on one that is not CS267 Lecture 4
Programming with Weaker Memory Models than SC • Possible to reason about machines with fewer properties, but difficult • Some rules for programming with these models • Avoid race conditions • Use system-provided synchronization primitives • If you have race conditions on variables, make them volatile • At the assembly level, may use “fences” (or analogs) directly • The high level language support for these differs • Built-in synchronization primitives normally include the necessary fence operations • lock (), … only one thread at a time allowed here…. unlock() • Region between lock/unlock called critical region • For performance, need to keep critical region short CS267 Lecture 4
static lock lk; lock(lk); lock(lk); unlock(lk); unlock(lk); Improved Code for Computing a Sums = f(A[0]) + … + f(A[n-1]) static int s = 0; Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 • Since addition is associative, it’s OK to rearrange order CS267 Lecture 4
static lock lk; lock(lk); lock(lk); unlock(lk); unlock(lk); Improved Code for Computing a Sums = f(A[0]) + … + f(A[n-1]) static int s = 0; Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 • Since addition is associative, it’s OK to rearrange order • Critical section smaller • Most work outside it CS267 Lecture 4
Caches and Scientific Computing • Caches tend to perform worst on demanding applications that operate on large data sets • transaction processing • operating systems • sparse matrices • Modern scientific codes use tiling/blocking to become cache friendly • easier for dense matrix codes (eg matmul) than for sparse • tiling and parallelism are similar transformations to program CS267 Lecture 4
Sharing: A Performance Problem • True sharing • Frequent writes to a variable can create a bottleneck • OK for read-only or infrequently written data • Technique: make copies of the value, one per processor, if this is possible in the algorithm • Example problem: the data structure that stores the freelist/heap for malloc/free • False sharing • Cache block may also introduce artifacts • Two distinct variables in the same cache block • Technique: allocate data used by each processor contiguously, or at least avoid interleaving in memory • Example problem: an array of ints, one written frequently by each processor (many ints per cache line) CS267 Lecture 4
What to Take Away? • Programming shared memory machines • May allocate data in large shared region without too many worries about where • Memory hierarchy is critical to performance • Even more so than on uniprocessors, due to coherence traffic • For performance tuning, watch sharing (both true and false) • Semantics • Need to lock access to shared variable for read-modify-write • Sequential consistency is the natural semantics • Architects worked hard to make this work • Caches are coherent with buses or directories • No caching of remote data on shared address space machines • But compiler and processor may still get in the way • Non-blocking writes, read prefetching, code motion… • Avoid races or use machine-specific fences carefully CS267 Lecture 4
Creating Parallelism with Threads CS267 Lecture 4
Programming with Threads Several Thread Libraries • PTHREADS is the Posix Standard • Solaris threads are very similar • Relatively low level • Portable but possibly slow • OpenMP is newer standard • Support for scientific programming on shared memory • http://www.openMP.org • P4 (Parmacs) is an older portable package • Higher level than Pthreads • http://www.netlib.org/p4/index.html CS267 Lecture 4
Language Notions of Thread Creation • cobegin/coend • fork/join • cobegin cleaner, but fork is more general cobegin job1(a1); job2(a2); coend • Statements in block may run in parallel • cobegins may be nested • Scoped, so you cannot have a missing coend tid1 = fork(job1, a1); job2(a2); join tid1; • Forked function runs in parallel with current • join waits for completion (may be in different function) CS267 Lecture 4
Forking Posix Threads • thread_id is the thread id or handle (used to halt, etc.) • thread_attribute various attributes • standard default values obtained by passing a NULL pointer • thread_fun the function to be run (takes and returns void*) • fun_arg an argument can be passed to thread_fun when it starts • errorcode will be set nonzero if the create operation fails Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg); CS267 Lecture 4
Posix Thread Example #include <pthread.h> void print_fun( void *message ) { printf("%s \n", message); } main() { pthread_t thread1, thread2; char *message1 = "Hello"; char *message2 = "World"; pthread_create( &thread1, NULL, (void*)&print_fun, (void*) message1); pthread_create(&thread2, NULL, (void*)&print_fun, (void*) message2); return(0); } Compile using gcc –lpthread See Millennium/Seaborg docs for paths/modules Note: There is a race condition in the print statements CS267 Lecture 4
Loop Level Parallelism • Many scientific application have parallelism in loops • With threads: … my_stuff [n][n]; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) … pthread_create (update_cell, …, my_stuff[i][j]); • But overhead of thread creation is nontrivial Also need i & j CS267 Lecture 4
Shared Data and Threads • Variables declared outside of main are shared • Object allocated on the heap may be shared (if pointer is passed) • Variables on the stack are private: passing pointer to these around to other threads can cause problems • Often done by creating a large “thread data” struct • Passed into all threads as argument CS267 Lecture 4
Basic Types of Synchronization: Barrier Barrier -- global synchronization • fork multiple copies of the same function “work” • SPMD “Single Program Multiple Data” • simple use of barriers -- all threads hit the same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier; • more complicated -- barriers on branches (or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier } • barriers are not provided in all thread libraries CS267 Lecture 4
Basic Types of Synchronization: Mutexes Mutexes -- mutual exclusion aka locks • threads are working mostly independently • need to access common data structure lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l); • Java and other languages have lexically scoped synchronization • similar to cobegin/coend vs. fork and join • Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion • Locks only affect processors using them: • pair-wise synchronization CS267 Lecture 4
A Model Problem: Sharks and Fish • Illustration of parallel programming • Original version (discrete event only) proposed by Geoffrey Fox • Called WATOR • Sharks and fish living in a 2D toroidal ocean • We can imagine several variation to show different physical phenomenon • Basic idea: sharks and fish living in an ocean • rules for movement • breeding, eating, and death • forces in the ocean • forces between sea creatures CS267 Lecture 4
Particle Systems • A particle system has • a finite number of particles. • moving in space according to Newton’s Laws (i.e. F = ma). • time is continuous. • Examples: • stars in space with laws of gravity. • electron beam and ion beam semiconductor manufacturing. • atoms in a molecule with electrostatic forces. • neutrons in a fission reactor. • cars on a freeway with Newton’s laws plus model of driver and engine. • Many simulations combine particle simulation techniques with some discrete event techniques • Sharks and Fish as simple example CS267 Lecture 4
Forces in Particle Systems • Force on each particle decomposed into near and far: force = external_force + nearby_force + far_field_force • External force • ocean current in sharks and fish • externally imposed electric field in electron beam. • Nearby force • sharks attracted to eat nearby fish • balls on a billiard table bounce off of each other. • Van der Waals forces in fluid (1/r6). • Far-field force • fish attract other fish by gravity-like (1/r2 ) force • gravity, electrostatics • forces governed by elliptic PDE. CS267 Lecture 4
Parallelism in External Forces • External forces are the simplest to implement. • The force on each particle is independent of other particles. • Called “embarrassingly parallel”. • Evenly distribute particles on processors • Any even distribution works. • Locality is not an issue, no communication. • For each particle on processor, apply the external force. CS267 Lecture 4
Need to check for collisions between regions Parallelism in Nearby Forces • Nearby forces require interaction and therefore communication. • Force may depend on other nearby particles: • Example: collisions. • simplest algorithm is O(n2): look at all pairs to see if they collide. • Usual parallel model is decomposition of physical domain: • O(n2/p) particles per processor if evenly distributed. • Often called domain decomposition (which also refers to numerical alg.) • Challenges: • Dealing with particles near processor boundaries • Dealing with load imbalance from nonuniformly distributed particles CS267 Lecture 4
Parallelism in Far-Field Forces • Far-field forces involve all-to-all interaction and therefore communication. • Force depends on all other particles: • Examples: gravity, protein folding • Simplest algorithm is O(n2) • Just decomposing space does not help since every particle needs to “visit” every other particle. • Use more clever algorithms to lower O(n2) to O(n log n) • Several later lectures • Implement by rotating particle sets. • Keeps processors busy • All processor eventually see all particles CS267 Lecture 4
Examine Sharks and Fish code • Gravitational forces among fish only • Use Euler’s method to move fish numerically • Sequential and Shared Memory with Pthreads: • www.cs.berkeley.edu/~demmel/cs267_Spr05/SharksAndFish CS267 Lecture 4
Extra Slides CS267 Lecture 4
Engineering: Intel Pentium Pro Quad SMP for the masses: • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth CS267 Lecture 4
Engineering: SUN Enterprise • Proc + mem card - I/O card • 16 cards of either type • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus CS267 Lecture 4
Outline • Historical perspective • Bus-based machines • Pentium SMP • IBM SP node • Directory-based (CC-NUMA) machine • Origin 2000 • Global address space machines • Cray t3d and (sort of) t3e CS267 Lecture 4
I/O De vices IOC IOC Mem Mem Mem Mem M M Inter connect M Proc Pr oc M IO IO P P 60s Mainframe Multiprocessors • Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices • How do you enhance processing capacity? • Add processors • Already need an interconnect between slow memory banks and processor + I/O channels • cross-bar or multistage interconnection network CS267 Lecture 4
70s Breakthrough: Caches • Memory system scaled by adding memory modules • Both bandwidth and capacity • Memory was still a bottleneck • Enter… Caches! • Cache does two things: • Reduces average access time (latency) • Reduces bandwidth requirements to memory memory (slow) A: 17 interconnect I/O Device or Processor P processor (fast) CS267 Lecture 4
Technology Perspective Capacity Speed Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 1.4x in 10 years Disk: 2x in 3 years 1.4x in 10 years DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1000:1! 2:1! CS267 Lecture 4
P P P 2 1 3 u = ? u = ? u = 7 4 5 $ $ $ 3 u :5 I/O devices 1 u u :5 :5 Memory 2 Example: Write-thru Invalidate • Update and write-thru both use more memory bandwidth if there are writes to the same address • Update to the other caches • Write-thru to memory CS267 Lecture 4