370 likes | 527 Views
Relaxed Consistency Models. Outline. Lazy Release Consistency TreadMarks DSM system. Review: what makes a good consistency model?. Model is a contract between memory system and programmer Programmer follows some rules about reads and writes Model provides guarantees
E N D
Outline • Lazy Release Consistency • TreadMarks DSM system
Review: what makes a good consistency model? • Model is a contract between memory system and programmer • Programmer follows some rules about reads and writes • Model provides guarantees • Model embodies a tradeoff • Intuitive for programmer vs. Can be implemented efficiently
Treadmarks high level goals • Better DSM performance • Run existing parallel (and “correct”) code.
What specific problems with IVY are TreadMarks want to fix? • False sharing: two machines use different variables on the same page, at least on writes • IVY will make the pages bouncing back and forth • However, it doesn’t need to do so of two process (threads) working on different variables. • send only written bytes – not whole pages
Goal 1: Reducing the data to be sent • Goal: don’t send while page, just the written bytes. • On M1 write fault: • tell other hosts to invalidate but keep hidden copy. • M1 itself also keep the hidden copy. • On M2 fault: • M2 asks M1 for recent modifications. • M1 “diffs” current page against hidden copy. • M1 send differences to M2. • M2 applies diffs to its hidden copy and make the up-to-date version
Goal 2: allow multiple readers+writers • To cope with false sharing • no invalidation when a machine writes • no r/w r/o demotion when a machines reads • so, there will be multiple “different” copies of a page! which should a reader look at? • Diffs help here: can merge writes to same page • But, when to send the diffs? • No invalidations, no page faults, what triggers sending diffs?
Release Consistency • Think about how you program your multi-thread codes. While accessing the shared data, you should first get a lock and then accessing the data and final you have to release the lock. This is considered as the “correct” programming practice. • In distributed environment, think about we have a lock server. Each process should get a lock from the lock server before accessing the shared resources • Thus, we can send out write diffs on release to all copies of pages written. • This is a new consistency model!
Release Consistency Model • M0 wont see M1’s writes until M1 releases a lock • so machines can temporarily disagree on memory contents • If the programs always follow the rules of lock: • Locks force order no stale reads like sequential consistency • But, if you do not follow this guideline (don’t lock) • reads can return stale data • concurrent writes to same variable trouble (data race) • Benefit? • multiple machines can have copies of a page, even when 1 or more writes • no bouncing of pages due to false sharing • read copies can co-exist with writers • relies on write diffs otherwise can’t reconcile concurrent writes to same page
Lazy Release Consistency Model • Do we really need to update the pages at moment of release a lock? Suppose you never use a variable which is updated by some processes in the system. You do not need to get notified by the update event for the variable. • Only fetch write diffs on acquire of a lock and only fetch from previous holder of that lock. Thus nothing happens at time of write or release. • This is called as Lazy Release Consistency Model (LRC) and is another new consistency model! • LRC hides some writes that RC reveals. • Benefit? • if you don’t acquire lock on object, you don’t have to fetch updates to it • if you use just some variables on a page, no need to fetch writes to others • less network traffic
Every Write is broadcasted More Message Passing Writes are broadcasted only synchronization points More Memory overhead Sequential vs Release Consistency
Read-Write False Sharing w(x) w(x) w(x) r(x) r(y) r(y)
Read-Write False Sharing w(x) w(x) r(y) r(y) r(x) synch
Write-Write False Sharing w(x) w(x) w(x) r(x) w(y) w(y) synch
Multiple-Writer False Sharing w(x) w(x) w(x) w(y) r(x) w(y) synch
Example 1 (false sharing) • x and y are on the same page. (a: acquire, r: release) • M0: a1 for (…) x++ r1 • M1: a2 for (…) y++ r2 a1 print x, y r1 • What does IVY do? • What does Treadmarks do? • M0 and M1 both get cached writeable copy of the page • when they release, each computes diff against original page • M1’s a1 cause it to pull write diffs from last holder of lock1, so M1 update x in its page.
Example 2 (LRC) • x and y on same page • M0: a1 x=1 r1 • M1: a2 y=1 r2 • M2: a1 print x r1 • What does IVY do? • What does Treadmarks do? • M2 only ask previous holder of lock 1 for write diffs • M2 does not see M1’s modification to y, even though on the same page
Discussion • Q: is LRC a win over IVY if each variable on a separate page? (No) • Q: why is LRC a reasonably intuitive model for programmers? • It is the same as sequential consistency if the programmers always use lock and unlock locks. (follow the rules defined by LRC) • but, non-locking code does not work. like v=f(); done=1;
Example 3 (motivate vector timestamps) • M0: a1 x=1 r1 • M1: a1 a2 y=x r2 r1 • M2: a2 print x, y r2 • What’s the “right ” answer? • we need to define what LRC guaranetees • answer: when you acquire a lock, • you see all writes by previous holderand all writes previous holder saw
What does TreadMarks do for example 3? • What does TreadMarks do? • M2 and M1 need to decide what M2 needs and doesn’t already have uses “vector timestamps” • each machine numbers its releases (i.e. write diffs) • M1 tells M2: • at release, had seen M0’s writes through #20, and see • 0:20 • 1:25 • 2:19 • 3:36 • …… • this is a “vector timestanmp” • M2 remembers a vector timestamp of writes it has seen • M2 compares with M1’s VT to see what writes it needs from other machines.
Discussions • VTs order writes to same variable by different machines: • M0: a1 x=1 r1 a2 y=9 r2 • M1: a1 x=2 r1 • M2: a1 a2 z = x + y r2 r1 • M1 is going to hear “x=1” from M0, and “x=2” from M1. • How does M1 know what to do? • Could the VTs for two values of the same variable not be ordered? • M0: a1 x=1 r1 • M1: a2 x=2 r2 • M2: a1 a2 print x r2 r1
Programmer rules /system guarentees? • Programmer must lock around all writes to shared variables to order writes to same variable, otherwise “latest value” not well defined • to read latest value, must lock • if no lock for read, guaranteed to see values that contributed to the variables you did lock
Example of when LRC might work too hard • M0: a2 z=99 r2 a1 x=1 r1 • M1: a1 y=x r1 • TreadMarks will send z to M1 because it comes before x=1 in VT order. • Assuming x and z are on the same page. • Even if on different pages, M1 must invalidate z’s page. • But M1 doesn’t use z • How could a system understand that z isn’t needed? • Require locking of all data you read thus to relax the causal part of the LRC model
Q: without using VM page protection? • It uses VM to • detect writes to avoid making hidden copies (for diffs) if not needed • detect reads to pages know whether to fetch a diff neither is really crucial so TreadMarks doesn’t depend on VM as much as IVY does IVY used VM faults to decide what data has to be moved and when TM uses acquire()/release() and diffs for that purpose
TreadMarks Implementation • Looks a lot like pthreads • Implicit message passing • Implicit process creation • Only standard Unix System Calls • Message Passing • Memory Management
Sends Messages at release of lock or at barriers Broadcasts Messages to all nodes Sends Messages when locks are acquired Message goes only to the required node Eager vs. Lazy RC
Memory Consistency • Done by creating diffs • Eager RC creates diffs at barriers • Lazy RC creates diffs at the first use of a page
Vector Timestamps 1 0 0 0 0 0 w(x) rel p1 1 1 0 acq w(y) rel 0 0 0 p2 acq r(x) r(y) p3 0 0 0
Garbage Collection • Used to merge all diffs – recover memory • Occurs only at barriers • All nodes that have a pages must have all diffs of that page.
DSM successful? • clusters of cooperating machines are hugely successful • DSM not so much • main justification is transparency for existing threaded code • that's not interesting for new apps • and transparency makes it hard to get high performance • MapReduce or message-passing or shared storage more common than DSM