1 / 23

Chapter 25: Distributed Databases

Chapter 25: Distributed Databases. Definitions. Distributed Database – a collection of of multiple logically interrelated databases distributed over a computer network

tlassiter
Download Presentation

Chapter 25: Distributed Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 25: Distributed Databases

  2. Definitions • Distributed Database– a collection of of multiple logically interrelated databases distributed over a computer network • Distributed Database Management System – A software system that manages a distributed database while making he distribution transparent to the user.

  3. Motivations for Distributed DBs • No centralized point of failure. • Local Autonomy. • There’s a whole lot of data out there to store. • Replication of Data for Disaster Recovery and High Availability (think RAID on a network) • High-throughput query processing (either inter-query or intra-query parallelism), dynamic load-balancing, • Poor people can’t afford supercomputers.

  4. Drawbacks of DDBs: • Security • Increased complexity of Database Design • Increased complexity of Software • Data integrity and resolution of concurrent operations. • Cost (But if you’re big enough to need one, you probably can afford one?)

  5. Transparency: Transparency of Data: • Location Transparency – A command works the same no matter where in the system it is issued • Naming Transparency – We can refer to data by the same name, from anywhere in the system, with no further specification. • Replication Transparency – Hides multiple copies of data from user • Fragmentation Transparency – Hide the fact that data is fragmented (ie, different sections of correlated data may be in different locations)

  6. Two Fundamental Patterns for Fragmenting Data • Horizontal – Store Whole Tuples on Different machines. • Nice because we can use standard relational algebra statements to define a restriction on a relation that creates these: • s”newyork” (City) s “chicago” (City) (Do we need to know all possible values for City in order to fully specify a fragmentation.)

  7. Vertical – Store Different Fields of the same tuples on Different machines. Use Projection Op to declare these: P (Acct #, Branch, Client Name Account) P (Acct #, Balance Account) (Notice this requires redundant storage of at least one primary key per tuple)

  8. Redundant / Non-Redundant Allocations: • Full Replication (Completely Redundant) • Good read time, good recoverability • Requires more coordination for multiple writers on same data, hogs disk space • No Replication (Non-Redundant) • Easier to coordinate multiple writers, multiple readers. But no backup in case of disaster. • Partial Replication • Trade-off between the above two options.

  9. Global Directory • Global Centralized (Why have a DDBMS at all if you’re going to do this?) • Dispersed or no Global Directory • Completely Replicated • Local-Master Directory • Each node has its own catalog of data • Each node has a directory to all of its data that is replicated elsewhere.

  10. Each database in a distributed database is distinct from all other databases in the system and has its own global database name.

  11. Name Resolution in Oracle8 • Every data object in every schema in every database has a unique identifying name: • SELECT * FROMscott.emp@sales.us.americas.acme_auto.com; • A remote query is a query that selects information from one or more remote tables, all of which reside at the same remote node. For example: • SELECT * FROM scott.dept@sales.us.americas.acme_auto.com;

  12. Remote and Distributed SQL Statements in Oracle8 • A remote update is an update that modifies data in one or more tables, all of which are located at the same remote node. For example: • UPDATE scott.dept@sales.us.americas.acme_auto.com SET loc = 'NEW YORK' WHERE deptno = 10; • A distributed query retrieves information from two or more nodes. For example: • SELECT ename, dname FROM scott.emp, scott.dept@sales.us.americas.acme_auto.com d WHERE e.deptno = d.deptno;

  13. A distributed update modifies data on two or more nodes. A distributed update is possible using a PL/SQL subprogram unit, such as a procedure or trigger, that includes two or more remote updates that access data on different nodes. For example: BEGIN UPDATE scott.dept@sales.us.americas.acme_auto.com SET loc = 'NEW YORK' WHERE deptno = 10; UPDATE scott.emp SET deptno = 11 WHERE deptno = 10; END;

  14. 2-Phase Commit Process • Easy to trigger with the COMMIT directive. • The Recoverer (RECO) background process on each server involved in the transaction coordinates to resolve any in-doubt transactions. • All RECOs either commit or roll-back the change in a consistent manner.

  15. Chapter 28: Data Warehousing and OLAP

  16. Data Warehousing “a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managements decisions” • Decision Support Systems or Executive Information Systems • Online Analytical Processing (OLAP) analysis of complex data from a data warehouse

  17. Data Warehouses • Optimized for providing general information about large data sets instead of explicit information about individual data records • Multidimensional Matrices called Data Cubes (or hypercubes) • Efficient storage, data marts, distributed DW, federate DW

  18. Some steps in Data Acquisition • Data is extracted (from multiple heterogeneous sources) • Data must be formatted • Data must be cleaned (the most involved step) Data can be backflushed to its source after cleaning. • Data must be converted from its source (relational, OO, hierarchical) to the DW’s multidimensional scheme. • The data must actually be loaded.

  19. Basic Operations • Pivot (rotate) • Roll-Up (grouping) • Drill-Down (subdivision) • Slice and dice: Perform projection operations on dimensions. • Sort (data, by some criteria) • Select Data (by value or range)

  20. Chunk-Offset Compression Only stores the addresses and data for valid cells in each chunk in a (offset, cellValue) format Heum-Geun Kang and Chin-Wan Chung, Exploiting versions for on-line data warehouse maintenance in MOLAP servers, VLDB, 2002.

  21. Bitmap Indexing

  22. Multidimensional Schema • Components: • Dimension Tables – tuples of attributes of the dimension • Fact Table – Holds tuples that correspond to recorded facts. • Patterns: • Star Schema – A single table for each dimension. • Snowflake Schema – obtained by normalizing a star schema, creating a new hierarchy of multiple dimensional tables • Fact Constellation – A set of fact tables that share some dimension tables

  23. Data Warehousing vs. Materialized Views • DWs exist as persistent storage instead of being materialized on-demand. • DWs are multidimensional, not relational. Views of a relational database are relational. • DWs can be indexed to optimize performance. Views are dependant on the structure of the underlying database. • DWs contain compositions of data collected from multiple datasources. Views are derived from a single database.

More Related