370 likes | 456 Views
Modern Databases. Willem Visser RW334. The Web is Changing the Game. Databases used to be the domain of corporations with limited amounts of data and limited amounts of users Very valuable information, but not a lot of it Important users, but not many of them In the modern web-driven world
E N D
Modern Databases Willem VisserRW334
The Web is Changing the Game • Databases used to be the domain of corporations with limited amounts of data and limited amounts of users • Very valuable information, but not a lot of it • Important users, but not many of them • In the modern web-driven world • Enormous amounts of data are being generated • Millions of users are interested in that data
What is Wrong here? How to make the DB scale?
Distributed Database • A single logical database spread physically across computers in multiple locations that are connected by a data communications link
Major Objectives • Location Transparency • User does not have to know the location of the data • Data requests automatically forwarded to appropriate sites • Local Autonomy • Local site can operate with its database when network connections fail • Each site controls its own data, security, logging, recovery
Distributed Databases Advantages • Increased reliability/availability • Local control over data • Modular growth • Lower communication costs • Faster response for certain queries
Distributed Database Disadvantages • Software cost and complexity • Processing overhead • Data integrity exposure • Slower response for certain queries
Options forDistributing a Database • Data replication • Copies of data distributed to different sites • Horizontal partitioning/Sharding • Different rows of a table distributed to different sites • Vertical partitioning • Different columns of a table distributed to different sites • Combinations of the above
Data Replication • Advantages: • Reliability • Fast response • May avoid complicated distributed transaction integrity routines (if replicated data is refreshed at scheduled intervals) • Decouples nodes (transactions proceed even if some nodes are down) • Reduced network traffic at prime time (if updates can be delayed)
Data Replication (cont.) • Disadvantages: • Additional requirements for storage space • Additional time for update operations • Complexity and cost of updating • Integrity exposure of getting incorrect data if replicated data is not updated simultaneously Therefore, better when used for non-volatile (read-only) data
Factors in Choice ofDistributed Strategy • Funding, autonomy, security • Site data referencing patterns • Growth and expansion needs • Technological capabilities • Costs of managing complex technologies • Need for reliable service
Distributed DBMS • Distributed database requires distributed DBMS • Functions of a distributed DBMS: • Locate data with a distributed data dictionary • Determine location from which to retrieve data and process query components • DBMS translation between nodes with different local DBMSs (using middleware) • Data management functions: security, concurrency, deadlock control, query optimization, failure recovery • Data consistency (via multiphase commit protocols) • Global primary key control • Scalability • Data and stored procedure replication • Allowing for different DBMSs and application code at different nodes
Distributed DBMSTransparency Objectives • Location Transparency • User/application does not need to know where data resides • Replication Transparency • User/application does not need to know about duplication • Failure Transparency • Either all or none of the actions of a transaction are committed • Each site has a transaction manager • Logs transactions and before and after images • Concurrency control scheme to ensure data integrity • Requires special commit protocol
Query Optimization • In a query involving a multi-site join and, possibly, a distributed database with replicated files, the distributed DBMS must decide where to access the data and how to proceed with the join. Three step process: • Query decomposition–rewritten and simplified • Data localization–query fragmented so that fragments reference data at only one site • Global optimization– • Order in which to execute query fragments • Data movement between sites • Where parts of the query will be executed • Semi join operation: only the joining attribute of the query is sent from one site to the other, rather than all selected attributes
Brewer’s CAP Theorem • Eric Brewer, Keynote at ACM Symposium on the Principles of Distributed Computing 2000 • You cannot have all three of: • Consistency • Availability • Partition Tolerance • Nothing short of complete network failure and the system must keep functioning • Theorem proven in 2002 by Gilbert and Lynch • See http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Dealing with CAP? • Drop Partitioning Tolerance • Don’t partition, but then you have serious scalability issues, which is probably why you want to partition in the first place • Drop Availability • Wait for all the partitions to sync before allowing any usage • This is as bad for scalability as having no partitioning • Drop Consistency • Eventual Consistency seems to work in most cases • If you have to drop one, this is the preferred option • Flies against most DB principles
Relational DB? • Seems like we are assuming the DB must still be relational • Web also forces a new concept • Not all data look the same anymore! • Email messages, Images, News documents, Facebook updates, Tweets,… • Relations are too rigid • Semi-structured data
The Information-Integration Problem • Related data exists in many places and could, in principle, work together. • But different databases differ in: • Model (relational, object-oriented?). • Schema (normalized/ not normalized?). • Terminology: are consultants employees? Retirees? Subcontractors? • Conventions (meters versus feet?). • How do we model information residing in heterogeneous sources (if we cannot combine it all in a single new database)?
Example • Suppose we are integrating information about bars in some town. • Every bar has a database. • One may use a relational DBMS; another keeps the menu in an MS-Word document. • One stores the phones of distributors, another does not. • One distinguishes ales from other beers, another doesn’t. • One counts beer inventory by bottles, another by cases.
Semi-structured Data • Purpose: represent data from independent sources more flexibly than either relational or object-oriented models. • Think of objects, but with the type of each object its own business, not that of its “class.” • Labels to indicate meaning of substructures. • Data is self-describing: structural information is part of the data.
Graphs of Semistructured Data • Nodes = objects. • Labels on arcs (attributes, relationships). • Atomic values at leaf nodes (nodes with no arcs out). • Flexibility: no restriction on: • Labels out of a node. • Number of successors with a given label.
Notice a new kind of data. The beer object for Bud The bar object for Joe’s Bar Example: Data Graph root beer beer bar manf manf prize name A.B. name year award servedAt Bud M’lob 1995 Gold name addr Joe’s Maple
XML • XML = Extensible Markup Language. • While HTML uses tags for formatting (e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”). • Key idea: create tag sets for a domain (e.g., bars), and translate all data into properly tagged XML documents. • Well formed XML - XML which is syntactically correct • tags and their nesting totally arbitrary. • Valid XML - XML which has DTD (document type definition) • imposes some structure on the tags, but much more flexible than relational database schema. • DTD and XML Schema • Meta-data for XML • Describe what are valid XML structures
XML and Semi-structured Data • Well-Formed XML with nested tags is exactly the same idea as trees of semi-structured data. • XML also enables non-tree structures (with references to IDs of nodes), as does the semi-structured data model.
A NAME subelement Root tag A BEER subelement Tags surrounding a BAR element Example: Well-Formed XML <?xml version = “1.0” standalone = “yes” ?> <BARS> <BAR><NAME>Joe’s Bar</NAME> <BEER><NAME>Bud</NAME> <PRICE>2.50</PRICE></BEER> <BEER><NAME>Miller</NAME> <PRICE>3.00</PRICE></BEER> </BAR> <BAR> … </BARS>
Example • The <BARS> XML document is: BARS BAR BAR BAR NAME . . . BEER BEER Joe’s Bar PRICE NAME PRICE NAME Bud 2.50 Miller 3.00
DTD Elements • The description of an element consists of its name (tag), and a parenthesized description of any nested tags. • Includes order of subtags and their multiplicity. • Leaves (text elements) have #PCDATA (Parsed Character DATA ) in place of nested tags.
A BARS object has zero or more BAR’s nested within. A BAR has one NAME and one or more BEER subobjects. A BEER has a NAME and a PRICE. NAME and PRICE are text. Example: DTD <!DOCTYPE BARS [ <!ELEMENT BARS (BAR*)> <!ELEMENT BAR (NAME, BEER+)> <!ELEMENT NAME (#PCDATA)> <!ELEMENT BEER (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> ]>
Querying XML • Why query XML-documents? • special XML databases • major DBMSs “speak” XML; • Does the world need anew query language? • Most of the world's businessdata is stored in relational databases; • The relational language SQL ismature and well-established; • Can SQL be adapted to queryXML data? • Leverage existing software • Leverage existing user skills
XML vs Relational Data • Relational data is "flat”: rows and columns; • XML data is nested: and its depth may beirregularand unpredictable; • Relations can represent hierarchic data by foreignkeys or by structured datatypes; • In XML it is natural to search for objects at • unknownlevels of the hierarchy: • "Find all the red things“;
XML vs Relational Data (cont.) • Relational data is uniform and repetitive; • All bank accounts are similar in structure; • Metadata can be factored out to asystem catalog; • XML data is highly variable; • Every web page is different; • Each XML object needs to be self-describing; • Metadata is distributed throughout the document; • Queries may access metadata as well as data: "Find elements whose name is the same as their content“: • //*[name(.) =string(.)]
XML vs Relational Data (cont.) • Relational queries return uniform sets of rows; • The results of an XML query may have mixed typesand complex structures; • "Red things": aflag, acherry, astopsign, ... • Elements can be mixed with atomic values • XML queries need to be able to perform structuraltransformations • Example: invert ahierarchy;
XML vs Relational Data (cont.) • The rows of arelation are unordered • Any desired output ordering must be derived from values; • The elements in an XML document are ordered • Implications for query: • Preserve input order inquery results • Specify an outputordering at multiple levels • "Find the fifth step“ • "Find all the tools usedbefore the hammer“;
XML vs Relational Data (cont.) • Relational data is "dense“ • Every row has avalue in every column; • A"null" value is needed for missing or inapplicable data • XML data can be "sparse“ • Missing or inapplicable elements can be "empty“ or "not there“ • This gives XML adegree of freedom not present in relational databases
XPATH and XQUERY • XPATH is a language for describing paths in XML documents. • Really think of the semi-structured data graph and its paths. • Why do we need path description language: can’t get at the data using just Relation.Attribute expressions. • XQUERY is a full query language for XML documents with power similar to OQL (Object Query Language, query language for object-oriented databases). 'Modern' Databases