360 likes | 571 Views
Lecture 1 Introduction to Principles of Distributed Computing. Sergio Rajsbaum Math Institute UNAM, Mexico. Lecture 1. Part I: Two-phase commit. An example of a distributed protocol
E N D
Lecture 1Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico
Lecture 1 • Part I: Two-phase commit. An example of a distributed protocol • Part II: What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation
Part I: Two-phase commit An example of an important distributed DBMS protocol
Principles of Distributed Computing • With the development of networking infrastructures and software, it is more and more rare to find a system that does not interact with other systems • Distributed computing studies systems where components interact and collaborate • Principles of distributed computing tries to understand the fundamental possibilities and limitations of such systems, with a precise, scientific approach • Goal: to design efficient and reliable systems, and techniques to design them, analyze them and prove them correct, or to prove impossibility results when no protocol exists
Distributed Commit An example from databases [ Garcia-Molina, Ullman, Widom, Database Systems, 2001 ]
Distributed Commit • A distributed transaction with components at several sites should execute atomically • Example: A manager of a chain of stores wants to query all the stores, find the inventory of toothbrushes at each, and issue instructions to move toothbrushes from store to store in order to balance the inventory. • The operation is done by a single global transaction T that has component Ti at the i-th store and a component T0 at the office where the manages is located.
Sequence of activities performed by T • Component T0 is created at the site of the manager • T0 sends messages to all the stores instructing them to create components Ti • Each Ti executes a query at store i to discover the number of toothbrushes in inventory and reports this number to T0 • T0 takes these numbers and determines, by some algorithm we shall not discuss, what shipments of toothbrushes are desired. T0 then sends messages such as “store 10 should ship 500 toothbrushes to store 7” to the appropriate stores • Stores receiving instructions update their inventory and perform the shipments
Atomicity • Make sure it does not happen: some of the actions of T get executed, but others do not • We do assume atomicity of each Ti, through mechanisms such as logging and recovery • Failures make difficult the achievement of atomicity of T • A site fails or is disconnected from the network • A bug in the algorithm to redistribute toothbrushes instructs store 10 to ship more than it has
Example of failures • Suppose T10 replies to T0’s 1st message with its inventory. • The machine at 10 then crashes, the instructions form T0 are never received by T10 • However, T7 sees no problem, and receives the instructions from T0 • Can distributed transaction T ever commit?
Two-phase commit • To guarantee atomicity distributed DBMS use a protocol for deciding whether or not to commit a distributed transaction • Each component of the transaction will commit, or non will • The protocol is coordinator based. It could be the site at which the transaction originates, such as T0. Call it C
Phase 1 • C sends to each site “prepare T” • Each site receiving it, decides whether to commit or abort its component of T. It must eventually send this response to C • At this point the site performs all actions associated with the local T, to be sure the local component does not abort later due to a local failure, and sends “ready T” to C. Only C could instruct it to abort later on • If the site send “don’t commit” is can locally abort T, since T will surely abort even if another wants to commit
Phase 2 • It begins when C receives the responses “ready” or “don’t commit” by all. Or after a timeout, in case of a site fails or gets disconnected • If C receives “ready” from all then it decides to commit T. Sends “commit T” to all • If C receives a “don’t commit” it aborts T, and sends “abort” to all • If a site receives “commit” it commits its local component of T; if it receives “abort” it aborts it
Recovery in case of failures Two cases: when C fails or when another fails
Recovery of a non-C site • Suppose a site S fails during a two-phase commit. Need to make sure when it recovers is does not introduce inconsistencies • The hard case: its last log is it sent a “ready” to C • Then it must communicate to at least one other site to find out the global decision for T. In the worst case, no other site can be contacted and the local component of T must be kept active
What if C fails? • The surviving participant sites must wait for C to recover, or after a waiting time, elect a new coordinator • Leader election problem is in itself a complex problem. The simple solution of interchanging IP addresses and electing the smallest works often, but may fail • Once a new leader C’ exists, it polls the sites about the status of all transactions • If some site had already received a “commit” from C then C’ sends “commit” to all • If some site had already received a “ abort” it sends it to all • If no site received such messages, and at least one was not ready to commit, it is safe to abort T
The hard case • No site had received “commit” or “abort” and every site is ready to commit • C’ cannot be sure if C found some reason to abort T (perhaps for a local reason, or some delays), or to commit it and it had already committed it locally. • Must wait until the original C recovers • In real systems the administrator has the ability to intervene manually and force the waiting transaction components to finish • The result is a possible loss of atomicity, and the person executing T will be notified
So, is the protocol correct ? Is it efficient? Is manual intervention unavoidable ??
How to analyze the protocol ? • What exactly does it mean to be correct? • Not just “either all commit or all abort” • What is the termination correctness condition ? • Under what circumstances it is correct? • E.g. Trivial if no failures are possible. • What if there are malicious failures? • How to choose the time-outs? • What is its performance? • Depends on the timeouts, delays. • On the type and number of failures • Is it efficient? Are there more efficient protocols?
Part II: What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation
What is distributed computing? • Any system where several independent computing components interact • This broad definition encompasses • VLSI chips, and any modern PC • tightly-coupled shared memory multiprocessor • local area cluster of workstations • internet, WEB, Web services • wireless networks, sensor networks, ad-hoc networks • cooperating robots, mobile agents, P2P systems
Computing components • Referred to processors or processes in the literature • Can represent a • microprocessor • process in a multiprocessing operating system • Java thread • mobile agent, mobile node (e.g. laptop), robot • computing element in a VLSI chip
Interaction – message passing vs. shared memory • Processors need to communicate with each other to collaborate, via • Message passing • Point-to-point channels, defining an interconnection graph • All-to-all using an underlying infrastructure (e.g. TCP/IP) • Broadcast; wireless, satellite • Shared memory • Shared-objects: read/write, test&set, compare&swap, etc • Usually harder to implement, easier to program
collaborate A distributed system processors Communication media
Failures • Any system that includes many components running over a long period of time must consider the possibility of failures • of processors and communication media • of different severity • from processor crashes or message loses, to • malicious Byzantine behavior
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
Parallel vs. distributed computing • Distributed computing • focuses on the more loosely coupled end of the spectrum • Each processor has its own semi-independent agenda, but needs to coordinate with others for fault-tolerance, sharing resources, availability, etc. • Processors tend to be somewhat independent: sometimes physically separated, running at independent speeds, communicating through a media with unpredictable delays • Multi-layered, many different issues need to be addressed: communication, synchronization, fault-tolerance… • Parallel processing • studies the more tightly-coupled end of the spectrum • usually all processors are dedicated to perform one large task • as fast as possible. Parallelizing a problem is a main goal.
Many kinds of problems • Clock synchronization • Routing • Broadcasting • Naming • P2P, how to share and find resources • sharing resources, mutual exclusion • Increasing fault-tolerance, failure detection • Security, authentication, cryptography • Database transactions, atomic commitment • Backups, reliable storage, file systems • Applications, airline reservation, banking, electronic commerce, publish/subscribe systems, web search, web caching, …
Multi-layered, complex interactionsAn example • A fault-tolerant broadcast service is useful to build a higher level database transaction module • Naming, authentication is required • And may work more efficiently if clocks are tightly synchronized • And good routing schemes should exist • If the clock synchronization is attacked, the whole system may be compromised
Chaos We need a good foundation, principles of distributed computing
Chaos • Too many models, problems and orthogonal, interacting issues • Very hard to get things right, to reproduce operating scenarios • Sometimes it is easy to adapt a solution to a different model, sometimes a small change in the model makes a problem unsolvable
Distributed computing theory • Models • Good models [Schneider Ch.2 in Distributed Systems, Mullender (Ed.)] • Relation between models: solve a problem only once; solve it in the strongest possible model • Problems • Search of paradigms that represent fundamental distributed computing issues • Relations between problems: hierarchies of solvable and unsolvable problems; reductions • Solutions • Design algorithms, verification techniques, programming abstractions • Impossibility results and lower bounds • Efficiency measures • Time, communication, failures, recovery time, bottlenecks, congestion
BibliographyTheory of distributed computing textbooks • Attiya, Welch, Distributed Computing, Wiley-Interscience, 2 ed., 2004 • Garg, Elements of Distributed Computing, Wiley-IEEE, 2002 • Lynch, Distributed Algorithms, Morgan Kaufmann,1997 • Tel, Introduction to Distributed Algorithms, Cambridge U., 2 ed. 2001
Bibliographyothers • Distributed Algorithms and Systems http://www.md.chalmers.se/~tsigas/DISAS/index.html • Conferences: DISC, PODC,… • Journals: Distributed Computing,… • Special issue PODC 20th anniversary, Sept. 2003 • ACM SIGACT News Distributed Computing Column. Also one in EATCS Bulletin