510 likes | 630 Views
Munin, Clouds and Treadmarks. Distributed Shared Memory course Taken from a presentation of: Maya Maimon (University of Haifa, Israel). Introduction. There are two fundamental models for parallel programming: Shared memory model
E N D
Munin, Clouds and Treadmarks Distributed Shared Memory course Taken from a presentation of: Maya Maimon (University of Haifa, Israel).
Introduction • There are two fundamental models for parallel programming: • Shared memory model • Direct extension of the conventional uniprocessor model wherein each processor is provided with the abstraction that there is but a single memory in the machine. • A update to shared data therefore becomes visible to all the processors in the system. • Distributed memory model. • each processor has a private memory to which no other processor has direct access. • communication by explicit message passing.
The Goal • Distributed memory machines are easier to build. • The shared memory programming model is, however, more attractive since the message passing paradigm requires them to explicitly partition data and manage communication. • Using a programming model that supports a global address space, an applications programmer can focus on algorithmic development rather than on managing partitioned data sets and communicating values. • A distributed shared memory (DSM) system provides a shared memory programming model on distributed memory machine.
Distributed Shared Memory • The system consists of the same hardware as distributed memory machine, with the addition of a software layer that provides the abstraction of a single shared memory. • In practice, each memory remains physically independent, and all communication takes place through explicit message passing performed by the DSM software layer. • DSM systems combine the best features of the two methods. They support the convenient shared memory programming model on distributed memory hardware, which is more scalable and less expensive to build.
Node n Node 1 Node 2 Mem Mem Mem network distributed shared memory Software Distributed Shared Memory page based, permissions, … single system image, shared virtual address space, …
The challenge • Performance • What is the source of the overhead in DSM? • The large amount of communication • that is required to maintain consistency. • In another words, the maintenance of • the shared memory abstraction.
False Sharing • False sharing is a situation in which two or more processes access different variables within a page and at least one of the accesses is a write. • If only one process is allowed to write to a page at a time, false sharing leads to unnecessary communication, called the “ping-pong” effect.
Understanding False Sharing x w(x) w(x) w(x) A y p p p p p p B r(y) r(y) r(y) x w(x) w(x) w(x) A y page p1 B page p2 r(y) r(y) r(y)
Techniques for reducing the amount of communication • Software release consistency • Multiple consistency protocol • Write shared protocol • an update with timeout mechanism These techniques have been incorporated in the Munin DSM system.
The need for Relaxed Consistency Schemes • In any implementation of Sequential Consistency there should be some global control mechanism. • If processor write to shared data ,all the other should know about it ,before it can write again. • Either of writes or reads require memory synchronization operations. • In most implementation writes require some kind of memory synchronization: w(x) w(y) w(x) A B
barrier 1) Relaxed Consistency Schemes • The Relaxed Consistency Schemes are designed to allow less memory synchronization operations. • Writes can be delayed, aggregated, eliminated. • This results in less communication and therefore higher performance. w(x) w(y) w(x) A B
The idea • Release consistency exploits the fact that programmers use synchronization to separate accesses to shared variables by different threads. • The system then only needs to guarantee that memory is consistent at (select) synchronization points.
False Sharing in Relaxed Consistency Schemes • False sharing has much smaller overhead in relaxed consistency models. • The overhead induced by false sharing can be further reduced by the the usage of multiple-writer protocols. • Multiple-writer protocols allow multiple processes to simultaneously modify their local copy of a shared page. • The modifications are merged at certain points of execution.
(*) K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26. IEEE, May 1990. Release Consistency[Gharachorloo et al. 1990, DASH]* • Introduces a special type of variables, called synchronization variablesor locks. • Locks cannot be read or written to. They can be acquired and released. For a lock L those operations are denoted by acquire(L)andrelease(L)respectively • We will say that a process that acquired a lock L but has not released it, holdsthe lock L. • No more than one process can hold a lock L. One process holds the lock while others wait.
Using Release and Acquire to define execution-flow synchronization primitives • Let a set of processes release tokens by reaching the operation release in their program order. • Let another set (possibly with overlap) acquire those tokens by performing acquire operation, where acquire can proceed only when all tokens have already arrived from all releasing processes. • 2-way synchronization = lock-unlock, 1 release, 1 acquire • n-way synchronization = barrier, n releases, n acquires • PARC’s synch = k-way synchronization
Formal Definition of Release Consistency • Conditions for Release Consistency: • Before a read or write access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed, and • Before a release access is allowed to perform with respect to any other processor, all previous read or write accesses must be performed, and • acquire and release accesses are sequentially consistent.
w(x)1 r(x)0 r(x)? r(x)1 r(x)1 A rel(L1) acq(L1) B t Understanding RC From this point all processes must see the value 1 in X It is undefined what value is read here. It can be any value written by some process. Here it can be 0 or 1. 1 must be read according to rule (B), but the programmer can not be sure of it Programmer is sure that this will return 1 according to rules (C) and (A)
Acquire and Release • release serves as a memory-synch operation, or a flush of the local modifications to the attention of all other processes. • According to the definition, the acquire and release operations are not only used for synchronization of execution, but also for synchronization of memory, i.e. for propagation of writes from/to other processes. • This allows to overlap the two expensive kinds of synchronization. • This turns out also simpler on the programmer from semantic point of view.
Acquire and Release (cont.) • A release followed by an acquire of the same lock guarantees to the programmer that all writes previous to the release will be seen by all reads following the acquire. • The idea is to let the programmer decide which blocks of operations need be synchronized, and put them between matching pair of acquire-release operations. • In the absence of release/acquire pairs, there is no assurance that modifications will ever propagate between processes.
w(x) w(y) w(z) rel(L) P1 x z y P2 Implementing RC • The first implementation was proposed by the inventors of RC and is called DASH. • DASH combats memory latency by pipelining writes to shared memory. stalled ack ack ack • The processor is stalled only when executing a release, at which time it must wait for all its previous writes to perform.
w(x) w(y) w(z) rel(L) P1 x,y,z P2 Implementing RC (cont.) • It is important to reduce the number of messages exchanges, because every message has additional fixed overhead, independent of its size. • Another implementation of RC, called Munin reduces the number of messages by buffering writes until a release. Ack(x,y,z)
(*) John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and Performance of MUNIN. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152--164, October 1991. Eager Release Consistency[Carter et al. 1991, Munin]* • Implementation of Release Consistency (not a new memory model). • Postpone sending modifications to the next release. • Upon a release send all accumulated modifications to all caching processes. • No memory-synchronization operations on an acquire. • Upon a miss (no local caching of the variable) get latest modification from latest modifier (need some more control to store its identity, no big deal).
r(z)0 r(x)1 r(x)0 r(x)0 t Understanding ERC apply changes apply changes r(z)1 r(y)1 acq(L1) A z x,y apply changes w(x)1 w(y)1 r(z)1 B rel(L1) x,y z w(z)1 acq(L2) C rel(L2) apply changes • Release operation does not complete (is not performed) until the acknowledgements from all the processes are received.
Release Vs.Sequential consistency • Ordinary reads and writes can be buffered or pipelined between synchronization points. • Ordinary reads and writes following a release do not have to be delayed for the release to complete. • An acquire access does not have to delay for previous ordinary reads and writes to complete.
2) Write-Shared Protocol(Supporting Multiple Writersin ERC) • Modifications are detected by twinning. • When writing to unmodified page, its twin is created. • When releasing, the final copy of a page is compared to its twin. • The resulting difference is called a diff. • Twinning and diffing not only allow multiple writers, but also reduce communication. • Sending a diff is cheaper than sending an entire page.
write P twin writable working copy release: diff Twinning and Diffing
3)Multiple Consistency Protocol • The use of a single protocol for all shared data leads to a situation where some programs can be handled effectively by a given DSM system, while others cannot. • Types of access patterns that should be supported: • Conventional shared • Read only • Migratory • Write shared • synchronization
w(x)1 w(x)1 w(y)2 w(y)2 Update-based vs. Invalidate-based • In update-based protocols the modifications are sent whereas in invalidate-based protocol only notifications of modifications are sent. Update-based Invalidate-based rel(L) rel(L) P1 P1 x:=1 “I changed x and y” y:=2 P2 P2
w(x)1 w(y)2 Update-Based vs. Invalidate-Based (cont.) • Invalidations are smaller than the updates. • The bigger the coherency unit the bigger is the difference. • In invalidation-based schemes there can be significant overhead due to access misses. rel(L) P1 inv(x) x=1 y=2 get(x) get(y) inv(y) acq(L) P2 r(y) r(x)
4)update with timeout mechanism • The problem: Updates to a data item are propagated to all of its replicas, including those that are no longer being used. • In DSM systems, this problem becomes even more severe, because the main memories of the nodes in which the replicas are kept are very large and it takes a long time before a page gets replaced, if at all. • Solution: invalid the old copies.
Reducing the Number of Messages • In DASH and Munin systems all processes (or all processes that cache the page) see the updates of a process. • Consider the following example of execution in Munin: w(x) rel(L) P1 w(x) acq(L) rel(L) P2 w(x) acq(L) rel(L) P3 r(x) acq(L) P4 • There are many unneeded messages. In DASH even more. • This problem exists in invalidation-based schemes as well.
Reducing the Number of Messages (cont.) • Logically, however it suffices to update each processor’s copy only when it acquires L. w(x) rel(L) P1 w(x) acq(L) rel(L) P2 w(x) acq(L) rel(L) P3 r(x) acq(L) P4 • Therefore, a new algorithm, called Lazy Release Consistency (LRC) for implementing RC was proposed. • LRC is aimed at reducing both the number of messages and the amount of data exchanged.
(*) P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenopol. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115--132, Jan. 1994. Lazy Release Consistency[Keleher et al., Treadmarks 1992]* • The idea is to postpone sending of modifications until a remote processoractually needs them. • Invalidate-based protocol • The BIG advantage: no need to get modifications that are irrelevant, because they are already masked by newer ones. • NOTE: implements a slightly more relaxed memory model than RC!
Lazy release consistency • Enforces consistency at the time of an acquire. • Causes fewer messages to be sent. At the time of a lock release, Munin sends messages to all processors who cache data modified by the releasing processor. In contrast, in lazy release consistency, consistency messages only travel between the last releaser and the new acquirer. • But : • Somewhat more complicated • After a release, Munin can forget about all modifications the releasing processor made prior to the release.
Multiple writer protocolin Treadmarks • The problem with a single writer protocol. • False sharing can cause singlewriter protocols to perform even worse. • multiplewriter protocols allow multiple writers to simultaneously modify the same page, with consistency traffic deferred until a later time, in particular until synchronization occurs. • Using diff.
Multiple writer protocolin Treadmarks • The primary benefit of using diffs is • they can be used to implement multiplewriter protocols, • but they can also significantly reduce overall bandwidth requirements because diffs are typically much smaller than a page.
Formal Definition of Lazy Release Consistency • Conditions for Lazy Release Consistency: • Before a read or write access is allowed to perform with respect to any other process, all previous acquire accesses must be performed with respect to that other process, and • Before a release access is allowed to perform with respect to any other process, all previous read or write accesses must be performed with respect to that other process, and • acquire and release accesses are sequentially consistent.
r(x)? r(x)0 r(x)1 r(x)? r(x)? r(x)0 w(x)1 r(x)? r(x)? rel(L1) acq(L1) acq(L2) t Understanding the LRC Memory Model A B C • It is guaranteed that the acquirer of the same lock sees the modification that precede the release in program order.
CLOUDS • Clouds is a distributed operating system that integrates a set of loosely coupled machines into a conceptually centralized system. • The system is composed of: • Compute servers • Data servers • User workstations • Compute server and data server are logical designations
The Object Oriented Model • structures a software system as a set of • objects. • Object consists data and operations. • Objects respond to messages. Sending msg To object 1 Object 1 responds by executing a method object1 msg In the end,the method sends reply to the msg sender. The method read or update data and may sends messages to other objects
The Object Thread Model • Clouds has a similar structure as object thread, implemented at the operating system level. • Clouds objects are largegrained encapsulations of code and data which model distinct, persistent virtual address spaces. • The contents of an object is an instance of a class a class is a compiled program module. • Unlike virtual address spaces in conventional operating systems, the contents of a Clouds object are longlived. That is, a Clouds object exists forever and survives system crashes and shutdowns (like a file) until explicitly deleted.
Execution • A thread is a logical path of execution that executes code in objects, traversing objects as it executes. • Unlike processes in a conventional operating system, a Clouds thread is not bound to a single address space, since each object represents a distinct address space.
Created by user or A program Execution example thread Thread execute entry point in a object May invoke operations In other objects Access to a persistent data in the object The thread temporarily leaves the calling objects enters the called object Returns to the calling object
Clouds… • Clouds uses objects as the abstraction of storage and threads as the abstraction of computation. • Clouds is a general purpose operating system. It is intended to support all types of languages and applications, distributed or not. All applications can view the system as a monolith, but distributed applications may choose to view the system as loosely coupled compute and data servers. • Each compute facility in Clouds has access to all resources in the system.
Distributed Computations • A novel feature of Clouds that makes distributed programming simpler is the automatic migration of objects via Clouds object invocation and distributed shared memory (DSM) mechanisms.
Synchronous invocation Computing Server A Creates thread to invoke 1O Thread1 O1 Executing thread 1 Data and code1O are demand as a page to A Or,other node (B) do the computation the thread invoke B to perform the operation on O2 (will demad the page from data server to B),than the results are returned to A. If the system chose to invoke on A ,the page of O2 are on other node and are demand to node A. Ecexecute thread 1 in O1 invoke O2 Data Server O1
Synchronous invocation • This invocation mechanism is called synchronous, because execution in the current environment of the invoking thread waits until the invoked operation completes and results are returned.
Asynchronous invocation • When thread t1 executing in object 1 asynchronously invokes an operation of object 2 , a new thread t2 is created which executes the invoked operation. • Asynchronous invocation returns a variable p which can be used as a place holder for object return value of any type. • Thread t1 can continue with execution in object 1 • At a later time, thread t1 can wait for the completion of the operation being executed by t2 by executing a claim call on p. When claim p returns, t2 has completed and the results returned by the invoked operation are available to t1 through p.
More details… • The differences from RPC • Objects migration • In Clouds , a DSM coherence protocol is used to ensure that the data in anobject is seen by concurrent threads in a consistent fashion even if they are executing on differentcompute servers. • Since several threads can simultaneously execute in an object, it is necessary to coordinate access to the object data. This is handled by Clouds object programmers using Clouds system synchronization mechanisms such as semaphores and locks. Through these mechanisms, threads may synchronize their actions regardless of where they execute.
advantages • An application programmer does not have to be aware of distribution. • A distributed program can be written as a centralized program where processes communicate and synchronize using shared memory. • The degree of distribution or concurrency does not have to be decided at the time the program is written. • Clouds treats computation and storage orthogonally. Second, Clouds uses a single level persistent store for all objects.