220 likes | 294 Views
Transactions, Concluded, and the Future of Data Management. Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 4, 2003. Slide content courtesy of Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke. Final Administrivia.
E N D
Transactions, Concluded, and the Future of Data Management Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 4, 2003 Slide content courtesy of Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke
Final Administrivia • Project demos today and tomorrow • Final exam handed out at the end of today’s class • Finals plus project reports due by 1PM, 12/18/2003 • Project reports should be ballpark 10-15 pages • Remember, quality and clarity of presentation matters! • Also, email me a brief message detailing: • Your contributions to the project • Your group members’ contributions and your assessment of “group dynamics” • Turn in at my office, 576 Levine Hallor to my assistant, Kathy Venit, in 308 Levine Hall
Last Time… • We were discussing isolation levels • How to keep transactions from interfering with one another • Or at least, how to minimize this • Recall the strongest version of isolation was serializability
Theory of Serializability • A schedule of a set of transactions is a linear ordering of their actions • e.g. for the simultaneous deposits example: R1(X.bal) R2(X.bal) W1(X.bal) W2(X.bal) • A serial schedule is one in which all the steps of each transaction occur consecutively • A serializable schedule is one which is equivalent to some serial schedule (i.e. given any initial state, the final state is the same as one produced by some serial schedule) • The example above is neither serial nor serializable
Questions of Concern • Given a schedule S, is it serializable? • How can we "restrict" transactions in progress to guarantee that only serializable schedules are produced?
Conflicting Actions • Consider a schedule S in which there are two consecutive actions Ii and Ij of transactions Ti and Tj respectively • If Ii and Ij refer to different data items, then swapping Ii and Ij does not matter • If Ii and Ij refer to the same data item Q, then swapping Ii and Ij matters if and only if one of the actions is a write • Ri(Q) Wj(Q) produces a different final value for Q than Wj(Q) Ri(Q)
Testing for Serializability • Given a schedule S, we can construct a di-graph G=(V,E) called a precedence graph • V : all transactions in S • E : Ti Tj whenever an action of Ti precedes and conflicts with an action of Tj in S • Theorem: A schedule S is conflict serializable if and only if its precedence graph contains no cycles • Note that testing for a cycle in a digraph can be done in time O(|V|2)
T1 T2 T3 R(X,Y,Z) R(X) W(X) R(Y) W(Y) R(Y) R(X) W(Z) T1 T2 T3 Cyclic: Not serializable. An Example
Another Example T1 T2 T3 R(X) W(X) R(X) W(X) R(Y) W(Y) R(Y) W(Y) T1 T2 T3 Acyclic: serializable
Producing the Equivalent Serial Schedule • If the precedence graph for a schedule is acyclic, then an equivalent serial schedule can be found by a topological sort of the graph • For the second example, the equivalent serial schedule is: • R1(Y)W1(Y) R2(X)W2(X) R2(Y)W2(Y) R3(X)W3(X)
Locking and Serializability • We said that for a serializable schedule, a transaction must hold all locks until it terminates (a condition called strict locking) • It turns out that this is crucial to guarantee serializability • Note that the first (bad) example could have been produced if transactions acquired and immediately released locks.
Well-Formed, Two-Phased Transactions • A transaction is well-formed if it acquires at least a shared lock on Q before reading Q or an exclusive lock on Q before writing Q and doesn’t release the lock until the action is performed • Locks are also released by the end of the transaction • A transaction is two-phased if it never acquires a lock after unlocking one • i.e., there are two phases: a growing phase in which the transaction acquires locks, and a shrinking phase in which locks are released
Two-Phased Locking Theorem • If all transactions are well-formed and two-phase, then any schedule in which conflicting locks are never granted ensures serializability • i.e., there is a very simple scheduler! • However, if some transaction is not well-formed or two-phase, then there is some schedule in which conflicting locks are never granted but which fails to be serializable • i.e., one bad apple spoils the bunch.
Summary of Transactions • Transactions are all-or-nothing units of work guaranteed despite concurrency or failures in the system • Theoretically, the “correct” execution of transactions is serializable (i.e. equivalent to some serial execution) • Practically, this may adversely affect throughput isolation levels • With isolation levels, users can specify the level of “incorrectness” they are willing to tolerate
What to Look for Down the Road • … well, no one really knows the answer to this… • … But here are some hints, ideas, and hot directions • Sensors and streaming data • Peer-to-peer meets databases • “The Semantic Web” • Collaborative data sharing
Sensors and Streaming Data • No databases at all… • … Instead we have networks of simple sensors • Madden, starting at MIT • Gehrke, Cornell • Widom, Stanford • queries are in SQL • data is live and “streaming” • we compute aggregates over “windows”
What’s Interesting Here • We’re not talking about data on disk – we’re talking about queries over “current readings” • Sensors are generally “stupid” and may be battery-operated • A lot of challenges are networking-related: how to aggregate data before it gets sent, etc. • The next step (e.g., work initiated here @ Penn): including sensors that capture images – a very different problem! • This has many more compelling applications – security, monitoring, correlating multiple sensors, rescue operations, military logistics and coordination, etc.
Peer-to-Peer Computing • Fundamentally, our model of DBMSs tends to be centralized • Even for data integration: there’s a single mediator • This has many implications: central administration, central coordination, etc. • What can be gained from borrowing a page from peer-to-peer systems like Napster, Kazaa, etc.? • A better architecture? • Solutions to many problems unsolved by distributed DBMSs? • Replication, object location, distributed optimization, resiliency to failure, … • New types of applications, e.g., in integration?
P2P Work • As a new architecture for storage and querying • PIER (Berkeley), P-Grid (EPFL), Medusa (MIT) • A better way of thinking about translating and exchanging data • Piazza (Washington), Orchestra (Penn), Hyperion (Toronto), work at Trento
The Semantic Web • In some ways, a very “pie-in-the-sky” vision • But some real and concrete problems might be partly solvable • Goal is really very similar to data integration, where somehow we have mappings between the schemas • Currently, most people in the SW community are from knowledge representation community and use RDF • Focus: very rich ways of describing schemas – “ontologies” – that blend querying with class definitions • “Teachers are people who teach students”“Tenure-track professors are teachers at universities who can get tenure”; etc. • Implicit take on the problem: if we create better languages for describing ontologies, it’s easier to mediate between schemas
Holes in the Semantic Web • What issues and concerns came up in the data integration assignment you had? • Do you think a richer schema language would help for these? • Do you think “better normalization” would help? • Fundamentally, we need: • Languages for not only describing relationships, but transformations between formats (e.g., XML schemas) • Automatic or partly automated ways of discovering mappings and correspondences • These are all database problems, and the solution likely must come from the DB community • This is part of what P2P systems like Piazza, Hyperion try to address
My Take on the Future • We’ve evolved from a world where data management is about controlling the data • Instead, data management is about translating and transforming data using declarative languages • It should ultimately become much like TCP or SOAP – a set of standard services for “getting stuff” from one point to another, or from one form to another • It’s the plumbing that connects different applications using different formats • Orchestra project at Penn: focuses on how to build a system for supporting collaborative science • People publish and map data in different schemas • What happens if people start updating it? • How do you propagate, manage, trace, reconcile changes?