270 likes | 712 Views
Reference Book Principles of Distributed Database System Chapters Chapter 12: Distributed DBMS Reliability Chapter 14: Distributed Object Database Management Systems Chapter 16: Current Issues . Preethi Vishwanath Week 2 : 12 th September 2006 –24 th September 2006.
E N D
Reference Book Principles of Distributed Database SystemChapters Chapter 12: Distributed DBMS Reliability Chapter 14: Distributed Object Database Management Systems Chapter 16: Current Issues Preethi Vishwanath Week 2 : 12th September 2006 –24th September 2006
Reliability concepts - definitions System refers to a mechanism that consists of a collection of components and interacts with its environment with a recognizable pattern of behavior. Each component of a system is itself a system, commonly called a subsystem. The way components of a system are put together is called the design of the system. An external state of a system can be defined as the response that a system gives to an external stimulus. The behavior of the system in providing response to all the possible stimuli from the environment needs to be laid out in an authoritative specification of its behavior. Any deviation of a system from the behavior described in the specification is considered a failure. Some transactions could cause system failure, such internal states are called erroneous states. Any error in the internal states of the components of a system or in the design of a system is called a fault in the system. A permanent fault, also called a hard fault, is one that reflects an irreversible change in the behavior of the system.
Reliability Reliability refers to the probability that the system under consideration does not experience any failures in a given time interval. R(t) = Pr{0 failure in time [0,t] no failures at t = 0} where R(t) : reliability of the system Availability Refers to the probability that the system is operational according to its specification at a given point in time t A = µ / ح + µ where ح is a failure rate µ is a mean repair time Mean Time between Failures Is the expected time between subsequent failures in a system with repair. Can be calculated either from empirical data or from the reliability function Is related to the failure rate MTBF = ∫∞0 R(t) dt Mean Time to repair Expected time to repair a failed system. Is related to the repair rate Steady State availability of a system with exponential failure and repair rates can be specified as A = MTBF/(MTBF + MTTR)
Fault tolerance • Refers to a system design approach which recognizes that faults will occur • Fault prevention/Fault intolerance • Aim at ensuring that the implemented system will not contain any faults • Two aspects • Fault avoidance • Refers to the techniques used to make sure that faults are not introduced into the system • Involve detailed design methodologies such as design walkthroughs, design inspections etc.. • Fault removal • Refers to the techniques that are employed to detect any faults that might have remained in the system despite the application of fault avoidance and removed these faults. • Fault detection • Issue a warning when a failure occurs but do not provide any means of tolerating the failure. • Latent Failure • One that is detected some time after its occurrence • Mean time to detect • Average error latency time over a number of identical systems. • Fail-stop modules • Constantly monitors itself and when it detects a fault, shuts itself down automatically • Fail-fast • Implemented in software by defensive programming, where each software module checks its own state during state transactions. • Different ways of implementing process pairs • Lock-step • Automatic check pointing • State check pointing • Data check pointing • Persistent process pairs
Site(System) Failures Always assumed to result in the loss of main memory contents. Total failures, refers to the simultaneous failure of all sites in the distributed system. Partial Failure indicates the failure of only some sites while the others remain operational. Transaction Failure Incorrect input data Detection of present or potential deadlock Usual approach to take in cases of transaction failure is to abort the transaction. Media Failures Refers to the failures of the secondary storage devices that store the database. Duplexing of disk storage and maintaining archival copies of the database are common techniques that deal with this sort of catastrophic problem. Communication Failure Unique to the distributed case. Most common ones are the errors in the messages, improperly ordered messages, lost messages and line failures The term for the failure of the communication network to deliver messages and the confirmations within this period is performance failure Failure in Distributed DBMS
Interface between the local recovery manager and the buffer manager Main memory Secondary Storage Local Recovery Manager Stable database Fetch Flush Database Buffers (Volatile Database) Database Buffer Manager Write Read
In-Place Update Recovery Information Necessary to store info about database state changes, inorder to recover back. Recorded in the database log REDO Action Database needs to include sufficient data to permit the undo by taking the old database state and recover the new state UNDO Action Database needs to include sufficient data to permit the undo by taking the new database state and recover the old state. Out-of-place update recovery information Typical techniques Shadowing Every time an update is made, the old stable storage page, called shadow page is left intact and a new page with the updated data item values is written into the stable database. Differential files Network Partitioning Simple partition Network is divided into only two components Multiple partitioning Network is divided into more than two components Centralized Protocols Primary Site Makes sense to permit the operation of the partition that contains the primary site, since it manages the lock. Primary copy More than one partition may be operational for different queries. Voting-based Protocols Transactions are executed if a majority of the sites vote to execute it. Quorum-based voting can be used as a replica control method, as well as a commit method to ensure transaction atomicity in the presence of network partitioning. In case of non replicated databases, this involves the integration of the voting principle with commit protocols. Recovery Information
2 Phase Commit Protocol • The two phase commit protocol is a distributed algorithm which lets all sites in a distributed system agree to commit a transaction. • The protocol results in either all nodes committing the transaction or aborting, even in the case of site failures and message losses. • Basic Algorithm • Commit-request phase • The coordinator sends a query to commit message to all cohorts. • The cohorts execute the transaction up to the point where they will be asked to commit. They each write an entry to their undo log and an entry to their redo log. • Each cohort replies with an agreement message if the transaction succeeded, or an abort message if the transaction failed. • The coordinator waits until it has a message from each cohort
Commit phase • Success If the coordinator received an agreement message from all cohorts during the commit-request phase: • The coordinator writes a commit record into its log. • The coordinator sends a commit message to all the cohorts. • Each cohort completes the operation, and releases all the locks and resources held during the transaction. • Each cohort sends an acknowledgement to the coordinator. • The coordinator completes the transaction when acknowledgements have been received. • Failure • If any cohort sent an abort message during the commit-request phase: • The coordinator sends a rollback message to all the cohorts. • Each cohort undoes the transaction using the undo log, and releases the resources and locks held during the transaction. • Each cohort sends an acknowledgement to the coordinator. • The coordinator completes the transaction when acknowledgements have been received.
3 Phase Commit • Non blocking when failures are restricted to site failures • A commit protocol that is synchronous within one state transition is nonblocking if and only if its state transition diagram contains neither of the following. • No state that is “adjacent” to both a commit and an abort state. • No noncommittal state that is “adjacent” to a commit state. initial initial Commit --------------- Prepare Vote-commit ------------------------ Prepare-to-commit wait Ready Vote-commit ------------------------ Prepare-to-commit Vote-abort/ Global abort Global-abort/ Ack Prepare-to-commit ---------------------------------- Ready-to-commit abort pre-commit pre-commit abort Ready-to-commit/ Global-commit Global-commit/ Ack commit commit
Replication and Replica Control Protocols • Having replicas of data items improves system availability. • Advantages • With careful design, it is possible to ensure that single points of failure are eliminated • Overall system availability is maintained even when one or more sites fail. • Disadvantages • Whenever updates are introduced, the complexity of keeping replicas consistent arises and this is the topic of replication protocols.
Object Represents a real entity in the system Represented as a pair (object Identity, state) Enables referential object sharing. State Either an atomic value or a constructed value Value An element of D is a value, called an atomic value [a1:v1,…,an:vn], in which ai is an element of A and vi is either a value or an element of I, is called a tuple value. {v1,..,vn}, in which vi is either a value or an element of I, is called a set value. Class Grouping of common objects Template for all common objects Inheritance Declaring a type to be a subtype of another. Abstract Data Types Template for all objects of that type. Describes type of data by providing a domain of data with the same structure, as well as operations applicable to the objects of that domain. Abstraction capability commonly referred as encapsulation. Composition (Aggregation) Restriction on composite objects results in complex objects The composite object relationship between types can be represented by a composition graph. Collection User defined grouping of objects Similar to class in that it groups objects. Subtyping Based on specialization relationship among types. Concepts
Path partitioning A concept describe the clustering of all the objects forming a composite object into a partition. Can be represented as a hierarchy of nodes forming a structural index. Index contains the references to all the component objects of a composite object, eliminating the need to traverse the class composition hierarchy. Class Partitioning Algorithms Main issue is to improve the performance of user queries and applications by reducing the irrelevant data access. Affinity based approach Affinity among instance variables and methods and affinity among multiple methods can be used for horizontal and vertical class partitioning. Cost-Driven Approach Allocation Local behavior-local object Behavior, the object to which it is applied, and the arguments are all co-located. No special mechanism needed to handle this case. Local behavior-remote object Behavior, the object to which it is applied, and the arguments are all co-located. Two ways to deal Move th remote object to the site where the behavior is located. Ship the behavior implementation to the site where the object is located Object Distribution Design
Client-Server Architecture Application Application Client Client Object Browser Programmatic Interface Object Browser Programmatic Interface Query Interface Query Interface Object Manager Object Manager File & Index Manager N/W Page & Cache Manager Object Manager Query Optimizer N/W Server Lock Manager Lock Manager Storage Manager Server Storage Manager Page Cache Manager Page Cache Manager Object Database Object Database Object Server Architecture Page Server Architecture
Cache Consistency • Problem in any data shipping system that moves data to the clients. • Cache consistency algorithms • Avoidance-based synchronous algorithms • Clients retain read locks across transactions, but they relinquish write locks at the end of the transaction. • The client send lock requests to the server and they block until the server responds. • If the client requests a write lock on a page that is cached at other clients. • Avoidance-based asynchronous algorithms • Do not have the message blocking overhead present in synchronous algorithms. • Clients send lock escalation messages to the server and continue application processing • Avoidance-based deferred algorithms • Clients batch their lock escalation requests and send them to the server at commit time. • The server blocks the updating client if other clients are reading the updated objects. • Detection-based synchronous algorithms • Clients contact the server whenever they access a page in their cache to ensure that the page is not stale or being written to by other clients. • Detection-based asynchronous algorithms • Clients send lock escalation requests to the server, but optimistically assume that their requests will be successful. • After a client transaction commits, the server propagates the updated pages to all the other clients that have also cached the affected pages. • Detection-based deferred algorithms • Can outperform callback locking algorithms even while encountering a higher abort rate if the client transaction state completely fits into the client cache, and all application processing is strictly performed at the clients.
Object Identifier Management • Object Identifiers are system generated • Used to Uniquely identify every object • Transient object identity can be implemented more efficiently • Two common solutions • Physical Identifier approach (POID) • Equates the OID with the physical address of the corresponding object • Advantage , the object can be obtained directly from the OID. • Drawback, all the parent objects and indexes must be updated whenever an object is moved to a different page. • Logical Identifier approach (LOID) • Consists of allocating a system wide unique OID. • Since OIDs are invariant, there is no overhead due to object movement.
Three alternatives can be considered for the migration of classes (types) The source code is moved and recompiled at the destination The compiled version of a class is migrated just like any other object, or The source code of the class definition is moved, but not its compiled operations, for which a lazy migration strategy us used. Objects can be in one of the four states Ready, Ready objects are not currently invoked, or have not received a message, but are ready to be invoked to receive a message. Active Active objects are currently involved in an activity in response to an invocation or a message Waiting Waiting objects have invoked another object and are waiting for a response. Suspended Suspended objects are temporarily unavailable for invocation. Migration involves two steps Shipping the object from the source to the destination, and Creating a proxy at the source, replacing the original object. Object Migration
Object Clustering • Difficult for two reasons • Not orthogonal to object identity implementation. Logical OIDs incur more overhead , but enable vertical partitioning of classes. • Clustering of complex objects along the composition relationship is more involved because of object sharing . • Given a class graph, there are three basic storage models for object clustering • The decomposition storage model, partitions each object class in binary relations. • The normalized storage model stores each class as a separate relation. • The direct storage model enables multi-class clustering of complex objects based on the composition relationship.
Distributed Garbage Collection • As programs modify objects and remove references, a persistent object may become unreachable from the persistent roots of the system when there is no more reference to it. • Basic garbage collection algorithms can be categorized • reference counting • In reference counting, each object has an associated count o reference • Each time a program creates an additional reference that points to an object, the object’s count is incremented. • When reference to an object is destroyed, the corresponding count is decremented. • tracing-based. • Mark and sweep algorithms • Two phase algorithms • First phase, mark phase, starts from the root and marks every reachable object • Once all live objects are marked, the memory is examined and unmarked objects are reclaimed. • Copy-based algorithms • Divide memory into two disjoint areas From-space, Programs manipulate from this space To-space, left empty
Object Query Processor Architectures Open OODB project Separation between the user query language parsing structures and the operator graph on which the optimizer operates EPOQ project Approach to query optimization extensibility, where the search space is divided into regions TIGUKAT project Uses an object approach to query processing extensibility Is an extensible uniform behavioral model characterized by a purely behavioral semantics and a uniform approach to objects. Query Processing Issues Search space and transformation rules Search Algorithm Cost Function Can be defined recursively based on the algebraic processing tree. Parameterization Path Expression Rewriting and Algebraic Optimization Path Indexes Query Execution Path Indexes Algorithms Create an index on each class traversed Define indexes on objects across their type inheritance Access support relations, is a data structure that stores selected path expression. Set Matching Algorithms Centralized Algorithms Join execution algorithm Object Query Processing – Important issues
Data Delivery alternatives Pull-only Transfer of data from servers to clients is initiated by a client pull. Arrival of new data items or updates to existing data items are carried out a server without modification to clients unless clients explicitly poll the server. Push-only Transfer of data from servers to clients is initiated by a server push in the absence of any specific request from clients. Hybrid Combines the client-pull and server-push mechanisms. Architecture of a Data Warehouse Query/Analysis Reporting Data Mining Q U E R I E s Target Database Metadata repository Integ rate Source database
Semi structured Data • Free and commercial database on product information etc, interfaces to such sources, is typically a collection of fill-out forms. • Typically modeled as a labeled graph • A labeled graph are self-describing and have no schema. • Object Exchange Model is used to illustrate such a labeled graph • A label which is the name of the object class • A type which is either atomic (integer, string etc.) or set • A value which is either atomic or a set of objects • An optional object identifier Mediator-Wrapper Architecture Data Source Wrapper Global Data Dictionary Data Source Web Server Wrapper Data Source Wrapper
Problems with Pull-based approach users need to know a priori where and when to look for data. Mismatch between the asymmetric nature of some applications and the symmetric communications infrastructure on applications such as internet. Two types of asymmetry Network asymmetry, network bandwidth between client- server different from server-client. Distributed information systems, due to imbalance between the number of clients and the number of servers. Data, amount of data being transferred between client and server. Data volatility Why Push based technologies? Response to some of the problems inherent in pull-based systems. Algorithm – Push based approach Order the data items from hottest to coldest Partition the data items into ranges of items, such that the items in each range have similar application access profiles. The number of ranges is denoted by num_ranges. Choose the relative broadcast frequency for each range as integers (rel_freqi, where i is the range). Divide each range into smaller elements, called chunks (Cij is the j-th chunk of range i). Determine the number of chunks into which range i is divided as num_chunk, = max_chunks/rel_freqi, where max_chunks is the least common multiple of rel_freqi,¥i. Create the broadcast schedule by interleaving the chunks of each range using the following procedure. for I from 0 to max_chunks-1 by 1 do for j from 1 to max ranges by 1 do Broadcast chunk Cj, (i mod num_chunksj) end-for end-for
Difference between pull-based and push-based systems • Cache replacement policies • Prefetching mechanism • An idealized algorithm for page replacement is one which determines the page with the smallest ratio between its probability of access and its frequency of broadcast. • PIX algorithm, calculates the “cost” of replacing a page and replaces the least costly one. • The operation of the algorithm is as follows: • When a page Pi is brought into cache and inserted into a chain. Pri = 0, LTi = CurrentTime • When Pi is accessed again, it is moved to the top of its own chain and the following caculations are made: Pri = HF / (Current Time –LTi) + (1 – HF) * LTi , LTi = CurrentTime, • If a new page needs to be flushed out to open up space, a lix value is calculated for the pages at the bottom of each chain and the page with the lowest lix value is flushed out. The lix value is calculated as follows: lixi = Pri/rel-freqi where rel-freqi is the relative broadcast frequency of the range (disk) to which that page Pi belongs.