320 likes | 438 Views
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2. Lecture 11 Instructor: Haifeng YU. Today’s Roadmap. Back to parallel systems Some simplified exploration on concurrency control in database systems Every database is a parallel system
E N D
CS4231Parallel and Distributed AlgorithmsAY 2006/2007 Semester 2 Lecture 11 Instructor: Haifeng YU
Today’s Roadmap • Back to parallel systems • Some simplified exploration on concurrency control in database systems • Every database is a parallel system • http://research.microsoft.com/~philbe/ccontrol/ • Define “sequential consistency” in databases: Serializability • Two phase locking protocol to ensure serializability • Define “linearizability” in databases: External consistency • Two phase locking ensures external consistency as well CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Database is Just an Abstract Data Type • Abstract data type: A piece of data with allowed operations on the data • Integer X, read(), write() • Stack, push(), pop() • By definition, a database is a shared abstract data type • Accessed by multiple users • Processes may perform various operations (called transactions) on the database • Database consistency specifies what behavior is allowed when it is accesses by multiple processes CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Transactions An Example Transaction • Operations are called transactions in database context • There can be infinite numbers of different kinds of transactions (database is more flexible than for example, a stack!) • Each transaction may contain • StartTransaction(); • CommitTransaction(); • AbortTransaction(); • Read(x); • Write(y, value); • The term operation in the textbook refers to Read() and Write(). To avoid confusion, we will call them primitive operations. StartTransaction(); seatBooked = false; read(number of available seats on a flight); if (number > 0) { number--; write back number; seatBooked = true; } CommitTransaction(); CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
The Scheduler and Concurrency Control Transaction2 Transaction1 Transaction3 Start/Commit/Abort/Read/Write The job of the scheduler is concurrency control (i.e., ensuring the consistency of the database when it is accessed by multiple processes) Scheduler Read/Write Database CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
The Scheduler and Concurrency Control Transaction2 Transaction1 Transaction3 Start/Commit/Abort/Read/Write Scheduler itself is multi-threaded. May submit reads/writes to database in parallel. We assume that the database ensures sequential consistency for these reads/writes. Scheduler Read/Write Database CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Carry Over Definitions from Lecture 3 • A history H is a sequence of invocations and responses of transactions ordered by wall clock time • Sequential history • Legal sequential history • Equivalency between two histories • Process order • A history H is sequentially consistent if it is equivalent to some legal sequential history S that preserves process order CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability and Sequential Consistency • Most databases uses serializability as the definition of consistency – A customized version of sequential consistency specially designed for databases • Same as sequential consistency except the following caveats • Caveat 1: • When defining serializability, we assume that all transactions are from different processes (no process issues two transactions) • What does it mean: Process order is empty • Why reasonable: In DB applications, this is usually the case • Why helpful: Simplifies the design of the scheduler and give it more flexibility to improve performance • Corner cases: A user issues two transactions sequentially to the database, the second transaction may not see the effects of the first. • This does not violate serializability but most implementations of the scheduler will not have such behavior CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability and Sequential Consistency • Caveat 2: • In sequential consistency, each operation is executed by a single process (each operation is sequential) • Transactions are complex enough that we should allow parallel reads/writes in a transaction (as in the book). • Each transaction is itself a parallel system! • But we will assume here that each transaction is sequential for this lecture (makes no significant difference in terms of the results) • Read the book if you are interested in extended to parallel transactions CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability and Sequential Consistency • Caveat 3: Definition of equivalency • Two histories are equivalent if they have the same set of events • Same events imply all responses are the same • For transactions, responses include all the values written into the database in the transaction and all the values output to the user • Transactions may be so complex that we cannot easily make the judgment: Consider the following transaction UpdateX() { StartTransaction(); tmp = Read(X); tmp = (4*tmp^2 + 5*tmp +1) Write(X, tmp); CommitTransaction(); } CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
initially x = 1; A legal sequential history will have a final x value of 451. The history on the right is not sequentially consistent. Initially x = -0.5; A legal sequential history will have a final x value of -0.5. (-0.5 is the root of the equation tmp = 4tmp^2+5tmp+1) The history on the right is sequentially consistent. Whether it is sequentially consistent depends on the value of x (and insights of the code!) CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability and Sequential Consistency • Caveat 3: Definition of equivalency • Schedulers are not as smart as we are to figure that out • So we are going to be more pessimistic and define conflict equivalency • Two primitive operations are conflicting if: • They are both writes are they write the same data item • One is read and the other is write, and they read/write the same data item • Two histories H and H’ are conflict equivalent iff • They contain the same set of transactions • For any two conflicting primitive operations p1 and p2, p1 is before p2 in H p1 is before p2 in H’ • Conflict equivalency equivalency (assuming transactions are deterministic) (why?) • The reverse is not true (by the earlier example) CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability • A history H is serializable if it is conflict equivalent to some legal sequential history S • (For comparison: A history H is sequentially consistent if it is equivalent to some legal sequential history S that preserves process order.) • Different from linearizability: Serializability does not need to preserve operation partial order. • A later transaction may not see the effects of an earlier transaction. • Possible in most commercial databases. • But the chance is small due to the actual way of implementing the scheduler. (You actually need to spend some effort to increase such chance.) CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serialization Graph • A serialization graph SG(H) of a history H is a directed graph where: • Each transaction is a vertex in the graph • A directed edge from W to V exists iff W has a primitive operation p1 and V has a primitive operation p2 where p1 is before p2 in H and p1 and p2 conflict • Example history: • R(x)(by T1) R(x)(by T2) W(x)(by T1) W(y) (by T2) W(y)(by T1) R(x)(by T3) W(x)(byT3) T1 T2 T3 SG(H) may or may not be transitive CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability Theorem • Theorem: A history H is serializable iff SG(H) is acyclic. • If SG(H) is acyclic, then H is serializable: • Without loss of generality, let T1 T2 … be the topological sorting of the vertices in SG(H). Let S be the sequential history obtained by executing T1 T2 … sequentially. By definition, S is a legal sequential history. We need to show H is conflict equivalent to S. • Prove by contradiction. Assume H is not, then there exist W (containing primitive operation p1) and V (containing primitive operation p2) where p1 and p2 are ordered differently in H and S. Without loss of generality, suppose p1 is before p2 in H. Then there must be an edge from W to V in SG(H) and p1 will be before p2 in S as well. Contradiction. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serializability Theorem • Theorem: A history H is serializable iff SG(H) is acyclic. • If H is serializable then SG(H) is acyclic: • Prove by contradiction and assume that SG(H) has a cycle of T1 T2 …Tk T1. History H is conflict equivalent to some sequential history S. Because T1 has an edge to T2 in SG(H), it means T1 has an operation p1 and T2 has an operation p2 where p1 and p2 conflicts and p1 is before p2 in H. Since S is conflict equivalent to H, p1 must be before p2 as well. Since S is a serial history, T1 must be before T2 in S. • By same arguments, T2 is before T3 in S, T3 is before T4 in S, …. Tk is before T1 in S. This is impossible, however, because S is a serial history. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Serialization Graph and Theorem • Serialization graph gives us a systematic way to determine whether a history is serializable • Determination can always be done in polynomial number of steps • But for sequential history: • We did not have a systematic way • In some case, we have to enumerate all serial histories to compare – exponential number of steps • Why we did not discuss these before for sequential consistency? • Can you derive a similar theorem for sequential consistency? • So we can always make the determination in polynomial number of steps CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Ensuring Serializability: All About Performance • The scheduler can protect the entire database using a single critical section • Essentially produces a sequential history • Not efficient – readers (i.e. query transactions) should be able to access the database concurrently • The scheduler can protect the entire database using a Reader/Writer lock (c.f. the Reader/Writer problem in Lecture 2) • Query transactions obtain reader lock • Update transactions obtain writer lock • But databases are large and each transaction only touches a small portion CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Ensuring Serializability: All About Performance Locking individual data items for a transaction • Partition the database and use separate reader/writer locks for each partition • In the extreme, each partition is a data item AcquireReaderLock(x); AcquireWriterLock(y); Read(x); do some computation; Write(y, value); ReleaseReaderLock(x); ReleaseWriterLock(y); CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Ensuring Serializability: All About Performance • But the performance is still not very good • We may overestimate the set of data items that a transaction needs to access • We hold the locks for too long (imagine that the computation is solving some time-consuming problem Locking individual data items for a transaction AcquireReaderLock(x); AcquireWriterLock(y); Read(x); do some computation; Write(y, value); ReleaseReaderLock(x); ReleaseWriterLock(y); CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Ensuring Serializability: All About Performance • Lock the data items only when we use them • This won’t work (even intuitively) AcquireReaderLock(x); Read(x); ReleaseReaderLock(x); do some computation; AcquireWriterLock(y); Write(y, value); ReleaseWriterLock(y); CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Ensuring Serializability: All About Performance • Prove that the history is not serialiazable using the serialization theorem • It is impossible here to prove that it is not sequentially consistent CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
A Widely Used Protocol: Two-phase Locking • Two-phase locking: • A transaction must acquire lock for data item v before reading or writing v • A transaction cannot obtain any further locks once it releases any lock • Growing phase following by shrinking phase • May result in deadlock • Side note: A transaction may “upgrade” a reader lock to a writer lock. This is considered new lock acquire as well. • In the previous example, process 0 will not release the lock on x until the end. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Correctness of Two-phase Locking • Lemma 1: Let H be a history produced by two-phase locking. Suppose that SG(H) contains an edge from W to V. Then there exists some data item x such that W unlocks x before V locks x in H. • Proof: By definition of SG(H), if there is an edge from W to V, it means that there exist two primitive operations p1 (in W) and p2 (in V) such that they are conflicting and p1 is before p2 in H. Let x be the data item that p1 and p2 read or write. • By two-phase locking rule, W needs to lock x before p1 occurs and V needs to lock x before p2 occurs. The only possibility is that V locks x after W unlocks x. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Correctness of Two-phase Locking • Lemma 2: Let H be a history produced by two-phase locking. Suppose that SG(H) contains the path T_1T_2 … T_n. Then there exist data items x and y (x and y do not need to be distinct) such that T_1 unlocks x before T_n locks y in H. • Proof: Use an induction on n. Lemma 1 proves the case for n = 2. Assume the lemma hold for n-1 and we will prove it stills hold for n. • By the inductive assumption, we know that there exist x and z such that T_1 unlocks x before T_{n-1} locks z in H. Because there is an edge from T_{n-1} to T_n in SG(H), Lemma 1 tells us that we can find a data item y such that T_{n-1} unlocks y before T_n locks y. • By two-phase locking rule, T_{n-1} can only unlock y after it locks z. (key step!) Thus T_1 unlocks x before T_n locks y. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Correctness of Two-phase Locking • Theorem: Every history H generated by two-phase locking is serializable. • Prove by contradiction and assume H is not. Then SG(H) contains a cycle T_1T_2 … T_n T_1. By Lemma 2, we can find data items x and y such that T_1 unlocks x before T_1 locks y. By this violates two-phase locking rule. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Linearizability in Databases • (From Lecture 3) A history H is linearizable if 1. It is equivalent to some legal sequential history S, and 2. The operation partial order induced by H is a subset of the operation partial order induced by S • Same as for sequential consistency, we will customize the definition for database context • Caveat 1: Assume that all transactions are from different processes • Caveat 2: Transactions may be parallel (we do not consider these) • Caveat 3: Conflict equivalent instead of equivalent CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Linearizability in Databases • For databases, linearizability is sometime called external consistency. A history is externally consistent if: 1. It is conflict equivalent to some legal sequential history S, and 2. The operation partial order induced by H is a subset of the operation partial order induced by S • Two-phase locking actually ensures external consistency • C.f. slide 8, “most implementations of the scheduler will not have such behavior” that violates external order CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Two-Phase Locking Preserves External Consistency • Theorem: Any history H generated by two-phase locking is externally consistent. • Proof: For each transaction T in H, we define its linearization point to be the time immediately after it acquires the last lock. Obviously, by two-phase locking rule, T has not released any locks at its linearization point. (This is where we leverage the two-phase locking property.) We construct a legal sequential history S to be all transactions ordered by their linearization points. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Two-Phase Locking Preserves External Consistency • Claim 1: The operation (transaction) partial order induced by H is a subset of the operation (transaction) partial order induced by S • Proof: Suppose WV belongs to the transaction partial order induced by H. This means that W finishes before V starts. Obviously W finishes acquiring all locks before V finishes acquiring all locks. Thus W’s serialization point is before V’s, and W is before V in S. CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
W’s serialization point W unlocks x V locks x V’s serialization point X X X X Two-Phase Locking Preserves External Consistency • Claim 2: H is conflict equivalent to S. • H and S contain the same set of transactions (obvious) • For any two conflicting primitive operations p1 and p2, p1 is before p2 in H p1 is before p2 in S • Proof: It is sufficient to prove that p1 is before p2 in H p1 is before p2 in S (why?) • Let x be the data item accessed by both p1 and p2. Let W be the transaction containing p1 and V be the transaction containing p2. Because p1 is before p2 in H, W must unlock x before V locks x. We have: • W will be before V in S, and thus p1 will be before p2 in S CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2
Summary • Define “sequential consistency” in databases: Serializability • Two phase locking protocol to ensure serializability • Define “linearizability” in databases: External consistency • Two phase locking ensures external consistency as well CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2