290 likes | 416 Views
Investigating Distributed Database Systems. Challenges and Technology Kishore Puppala Rao. Definitions. A database is a logically related collection of data, stored in one or many files
E N D
Investigating Distributed Database Systems Challenges and Technology Kishore Puppala Rao
Definitions • A database is a logically related collection of data, stored in one or many files • A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network
Architecture • Client/server architectures • Multiple clients, single server – this is the most common and straightforward implementation • Multiple clients, multiple servers – more flexible. DB distributed over multiple servers. Each client directs requests to a “home” server.
Architecture (cont’d) • DB is physically distributed by fragmenting and replicating data (discussed later) • Regardless of architecture, implementation details of queries, transactions and DB operations should be transparent to users.
Architecture (Peer-to-peer) • No distinction between client and server • Each site has functionality of both client and server • E.g. File-sharing apps such as BearShare, LiveWire • Sophisticated protocols needed to manage data distributed across multiple sites
Fragmentation • Partitions the data • Subdivides each relation either vertically (by project operation) or horizontally (by selection operation) • Facilitates the placement of data close to its place of use, reducing transmission costs
Replication • Refers to duplication of data for access and/or security purposes • Fragments or whole database may be replicated • Replication involves keeping physical separate copies of data at different sites
Distributed vs. Parallel • Distributed DBMS are not parallel DBMS, although distinction may be unclear • Distributed DBMS assume loose connection between processors operating independently, perhaps under different operating systems
Parallel DBMS • Multiple processors under same operating system. • Architecture: Shared-none, shared-disk, or shared memory • Shared-Nothing: Each processor has exclusive access to its main memory and disk. Each processing element (PE) is a local site.
Parallel DBMS (cont’d) • Shared-memory: Each PE has access to any memory module or disk through some fast connection (e.g. LAN or cross-bar switch) • Shared-disk: Each PE has exclusive access to its own memory, but shared access to any disk via a fast connection. PE accesses DB pages on shared disk and copy to local cache
Transparency • Distributed (and Parallel) DBMS must provide same functionality and consistency of centralized DBMS. • Transparency implies presenting a consistent view that shields the user from implementation details such as fragmentation, replication, and distribution. • Introduces major challenges
Challenges • Query processing and optimization • Concurrency control • Reliability protocols • Replication protocols
Query Processing and Optimization • Techniques needed to address difficulties arising from data distribution and fragmentation. Localization techniques employed. • Algebraic queries on global relations are transformed to operate on fragments • Opportunities for parallel processing are identified (fragments are stored at different sites), unnecessary work is eliminated (not all fragments may be involved in the query)
Query optimization • Determining the execution sites for distributed operations • Identifying the best distributed algorithm for distributed operations • Changing the order of operations in a query
Concurrency Control • Challenge in synchronizing user transactions is to extend serializability and concurrency to the distributed execution environment • Serializability: The ability to perform a set of operations in parallel with the same effect as if they were performed in a certain sequence, requires: • (a) execution of the set of transactions at each site must be serializable • (b) the serialization orders of these transactions at all these sites must be identical
Concurrency (cont’d) • If locking-based algorithms used, lock management may be centralized or distributed • Deadlocks must be avoided • Deadlock detection and management in a distributed database can be difficult
Reliability protocols • Several types of failures: System, media, transaction, communication • May be difficult to differentiate type of failure • Distributed reliability protocols enforce transaction atomicity (commit all or commit nothing)
Reliability (cont’d) • E.g. of Atomic commitment protocol: Two-phase commit • All sites involved in the execution of a distributed transaction must agree to commit the transaction before it is made permanent.
Replication protocols • Each logical data item has a number of physical instances • Challenge is to maintain (or approximate) consistency among physical copies as user updates logical data • Example criterion: One-copy equivalence – All physical copies of logical data should be equivalent after being updated by a transaction • Read-One/Write All (ROWA) protocol – enforces one-copy equivalence. Disadvantage: failure of one site may block entire transaction
Replication (cont’d) • Alternative algorithms relax ROWA by mapping each write to a subset of the physical copies • Quorum-based voting: Copies are assigned votes; read and (especially) write operations have to collect votes and reach a quorum to commit data. (see class notes)
Research and Trends • Workflow models (advanced transaction models) • Network scaling problems • Multi-database systems and interoperability • Distributed object management
Trends (cont’d) • Primitive objects are not simple-structured data. Can consist of programs, voice, images, etc. • Distributed DBMS must handle increasingly larger data objects. E.g. 1MB storage needed for 1 digital X-Ray image (1024x1024) @ 8 bits/pixel • Most commercial DBMS (e.g. MS SQL Server 2000, Oracle 8i) provide some sort of distribution • Emergence of broadband networks eliminates the network as a bottleneck
Trends (cont’d) • Mobile computing is escalating in interest and prevalence • Mobile stations may download data as needed • Alternatively, more powerful mobile stations may store native data for sharing with others • Mobility raises issues of address migration, maintenance of directories, and determining the location of stations • Object-oriented DBMS e.g. CORBA (platform independent), COM/OLE (MS-specific)
CORBA • Common Object Request Broker Architecture • Facilitates the maintenance and DB access of data from a number of autonomous and heterogeneous sources (e.g. file systems, spreadsheets) via a multidatabase approach • Provides a generic platform for distributed computing
CORBA (cont’d) • In multidatabase systems, the main problem is the heterogeneity extant at four levels: platform, communication, database system, and semantic. • CORBA facilitates implementation transparency by providing client access via interfaces defined in a special Interface Definition Language (IDL), independent of the databases actual software and hardware environment. • Provides location transparency, allowing clients to access DB objects independent of location and communication protocols
CORBA (cont’d) • Provides a common interface to mask heterogeneity among native database system implementations based on different data models (e.g. flat-file, relational, spreadsheet) and query languages • Common interface overcomes semantic conflicts such as schema and data conflicts
References • M.T. Ozsu and P. Valduriez, "Distributed and Parallel Database Systems – Technology and Current State-of-the-Art", ACM Computing Surveys, 28(1): 125 - 128, March 1996. • A. Dogac, C. Dengi and M.T. Ozsu, "Distributed Object Computing Platforms", Communications of ACM, 41(9): 95-103, September 1998. • J. N. Gray, “Notes on Data Base Operating Systems.” Operating Systems: An Advanced Course. R. Bayer, R.M. Graham (eds.) New York: Springer-Verlag, 1979, pp. 393-481.
References (cont’d) • M.T. Ozsu, "The Push/Pull Effect - Can Distributed Database Technology Meet The Challenges of New Applications?", Database Programming & Design, April 1997.