320 likes | 327 Views
TreadMarks. Distributed Shared Memory on Standard Workstations and Operating Systems. Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel. Agenda. DSM Overview TreadMarks Overview Vector Clocks Multi-writer Protocol ( diffs ) TreadMarks Algorithm Implementation Limitations.
E N D
TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel
Agenda • DSM Overview • TreadMarks Overview • Vector Clocks • Multi-writer Protocol (diffs) • TreadMarks Algorithm • Implementation • Limitations
DSM Overview Proc Proc Proc Proc Mem Mem Mem Mem • Global address space virtualization of disparate physical memory • Program using normal thread/locking techniques (no MPI)
DSM Overview Proc Proc Proc Proc Mem Mem Mem Mem • Communication overhead incurred to synchronize memory • Maximize parallel computation and limit communication to improve performance
TreadMarks Overview • Minimize communications to improve DSM performance • Lazy Release Consistency (Vector Clocks) • Multiple Writers (Lazy Diff Creation) • Delay communication as long as possible (possibly even avoid)
TreadMarks OverviewRelease Consistency • Release Consistency: • Shared memory updates must be visible when the release is visible • No need to send updates immediately upon write w(x) P1 P2 w(x)
TreadMarks OverviewLazyRelease Consistency • Lazy Release Consistency: • Shared memory updates are not made visible until the time of acquire • No update propagated if update never acquired w(x) P1 P2 w(x)
Vector Clocks P1 P2 P3 • Global clock mechanism for identifying causal ordering of events in distributed systems • Mattern (1989) and Fidge (1991)
Vector Clocks 0 0 0 P1 0 0 0 P2 0 0 0 P3 • Each process maintains a vector of counters • One for each process in the system
Vector Clocks 0 0 0 P1 0 0 0 P2 0 0 0 P3 • Each process maintains a vector of counters • One for each process in the system
Vector Clocks 0 0 0 P1 1 0 0 0 0 0 P2 0 0 0 P3 • Increments own counter upon Local Event
Vector Clocks 0 0 0 P1 1 0 0 0 0 0 P2 0 0 0 P3 • Increments own counter upon Local Event 0 0 1
Vector Clocks 0 0 0 P1 2 0 2 1 0 0 0 0 0 P2 0 0 0 P3 • Increments own counter and updates all other counters upon Receiving Message 0 0 1 0 0 2
Vector Clocks 0 0 0 P1 2 0 2 1 0 0 3 0 2 0 0 0 P2 3 1 2 0 0 0 P3 • Increments own counter and updates all other counters upon Receiving Message 0 0 1 0 0 2
Diff Creation • Retains copy of page upon first writing P1 P2
Diff Creation • Retains copy of page upon first writing P1 P2
Diff Creation • Create diff by comparing modified page against original (RLC) P1 P2
Diff Creation • Send diff to other processes P1 P2
LazyDiff Creation • Diffs created only when a page is invalidated • Or the modifications are requested explicitly • access miss on invalidated page P1 P2
TreadMarks Algorithm 0 0 0 P1 1 0 0 0 0 0 P3 0 0 1 • P1 Cannot proceed past acquire until: • All modifications have been received from processes whose vector timestamps are smaller P1’s
TreadMarks Algorithm 0 0 0 P1 1 0 0 0 0 0 P3 0 0 1 1 0 0 • On acquire: • P1 Sends Vector Timestamp to releaser
TreadMarks Algorithm 0 0 0 P1 1 0 0 0 0 0 P3 0 0 1 1 0 0 1 0 1 • On acquire: • P1 Sends Vector Timestamp to releaser • P2 Attaches invalidations for all updated counters invalidate
TreadMarks Algorithm 0 0 0 P1 1 0 0 1 0 1 invalidate 0 0 0 P3 0 0 1 1 0 1 • On acquire: • P1 Sends Vector Timestamp to releaser • P2 Attaches invalidations for all updated counters • P2 Sends updated Vector Timestamp with invalidations invalidate
TreadMarks Algorithm 0 0 0 diff P1 w(x) 1 0 0 1 0 1 invalidate 0 0 0 P3 0 0 1 diff • Diffs generated when: • Receiving invalidation (i.e. P1 had made prior updates to this page also) • Page is accessed (miss)
Write notice record page 1 2 proc_id Interval*record Diff pool 1 *VC counter Proc array TreadMarks ImplementationData Structures Page array
TreadMarks ImplementationLocks • Each lock is statically assigned a manager (RR) • Keeps track of processors • Lock acquires are sent to manager (forwarded to last processor to obtain lock) • Upon release, sends (for each interval): • Processor ID and Vector Timestamp • Any invalidations that are necessary
TreadMarks ImplementationBarriers • Centralized barrier Manager • Upon arrival at barrier: • Notifies Manager of intervals that the manager does not already have • Incorporated when Manager arrives at barrier • When all clients have arrived: • Manager notifies all clients of intervals they do not already have • Expensive
Limitations • Achieved nearly linear speedup for TSP, Jacobi, Quicksort, ILINK algorithms • Water: • Each molecule in simulation is protected by lock and frequently accessed • Barriers used in synchronization • Speedup is limited by low computation to communication ratio of algorithm (many fine-grained messages)
Limitations • TSP: • Eager Release Consistency performs better than Lazy Release Consistency (Fig. 9) • Updates occur on invalidation and access misses(writes/synchronization points) • TSP algorithm reads stale ‘current minimum’ value without synchronization
Limitations • Depends on events (write/synchronization) to trigger consistency operations • More opportunities to read stale data (TSP) • Reduced redundancy increases risk of data loss
Summary • Improves performance by improving computation to communication ratio • Delay consistency updates until page access is acquired • Weaker consistency implies greater likelihood of reading stale data and data loss • Procrastination = Performance