590 likes | 704 Views
Chapter 10: XML. The world of XML. Context. The dawn of database technology 70s A DBMS is a flexible store-recall system for digital information It provides permanent memory for structured information. Context.
E N D
Chapter 10: XML The world of XML
Context • The dawn of database technology 70s • A DBMS is a flexible store-recall system for digital information • It provides permanent memory for structured information
Context • Database Managements technology for administrative settings ‘completed’ in the early 80s • Search for demanding application areas that could benefit from a database approach • A sound datamodel to structure the information and maintain integrity rules • A high level programming language model to manipulate the data • Separation of concerns between modelling and manipulation, and physical storage and order of execution thanks to query optimizer technology
Context • Demanding areas of research in DBMS core technology: • Office Information systems, e.g. document modelling and workflow • CAD/CAM, e.g. how to manage the design of an airplane or nucleur power plant • GIS, e.g. managing remote sensing information • WWW, e.g. how to integrate heterogenous sources • Agent-based systems, e.g. reactive systems • Multimedia, e.g. video storage/retrieval • Datamining, e.g. discovery of client profiles • Sensor networks, e.g. small footprint and energy-wise computing
Context • Demanding areas of research in DBMS core technology: • Office Information systems, Extensible DBMS, blobs • CAD/CAM, Object-oriented DBMS, geometry • GIS, GIS DBMS, geometry and images • Agent-based systems, Active DBMS, triggers • Multimedia, MM DBMS, feature analysis • Datamining, Datawarehouse systems, cube, association rules • Sensor networks, P2P databases, ad-hoc networking
Context • Application interaction with DBMS • Proprietary application programming interface, shielding the hardware distinctions • Use readable interfaces to improve monitoring and development • Example: in Monetdb the interaction is based on ascii text with the first character indicative for the message type ‘>’ prompt, await for next request ‘!’ error occurred, rest is the message ‘[‘ start of a tuple answer • Language embedding to remove the impedance mismatch, i.e. avoid cost of transforming data • Effectively failed in the OO world
Context • Learning points database perspective, • Database system should not be concerned with the user-interaction technology, ‘they should be blind and deaf’ • The strong requirements on schema, integrity rules and processing is a harness • Interaction with applications should be self-descriptive as much as possible, because, you can not a priori know a complete schema • Need for semi-structured databases
Semi-structured data • Properties of semistructured databases: • The schema is not given in advance and may be implicit in the data • The schema is relatively large and changes frequently • The schema is descriptive rather than prescriptive, integrity rules may be violated • The data is not strongly typed, the values of attributes may be of different type • Stanford Lore system is the prototypical first attempt to support semi-structured databases
Context Accidentally, in the world of digital publishing there is a need for a simple datamodel to structure information SMGL HTML XML XHTML XPATH XQUERY XSLT By the end 90s, the document world meets the database world
Introduction • XML: Extensible Markup Language • Defined by the WWW Consortium (W3C) • Originally intended as a document markup language not a database language • Documents have tags giving extra information about sections of the document • E.g. <title> XML </title> <slide> Introduction …</slide> • Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML • Extensible, unlike HTML • Users can add new tags, and separately specify how the tag should be handled for display
XML Introduction (Cont.) • The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents. • Much of the use of XML has been in data exchange applications, not as a replacement for HTML • Tags make data (relatively) self-documenting • E.g.<bank> <account> <account-number> A-101 </account-number> <branch-name> Downtown </branch-name> <balance> 500 </balance> </account> <depositor> <account-number> A-101 </account-number> <customer-name> Johnson </customer-name> </depositor> </bank>
XML: Motivation • Data interchange is critical in today’s networked world • Examples: • Banking: funds transfer • Order processing (especially inter-company orders) • Scientific data • Chemistry: ChemML, … • Genetics: BSML (Bio-Sequence Markup Language), … • Paper flow of information between organizations is being replaced by electronic flow of information • Each application area has its own set of standards for representing information (W3C maintains ca 30 standards) • XML has become the basis for all new generation data interchange formats
XML Motivation (Cont.) • Each XML based standard defines what are valid elements, using • XML type specification languages to specify the syntax • DTD (Document Type Descriptors) • XML Schema • Plus textual descriptions of the semantics • XML allows new tags to be defined as required • However, this may be constrained by DTDs • A wide variety of tools is available for parsing, browsing and querying XML documents/data
Motivation for Nesting • Nesting of data is useful in data transfer • Example: elements representing customer-id, customer name, and address nested within an order element • Nesting is not supported, or discouraged, in relational databases • With multiple orders, customer name and address are stored redundantly • normalization replaces nested structures in each order by foreign key into table storing customer name and address information • Nesting is supported in object-relational databases and NF2 • But nesting is appropriate when transferring data • External application does not have direct access to data referenced by a foreign key
<bank-1><customer> <customer-name> Hayes </customer-name> <customer-street> Main </customer-street> <customer-city> Harrison </customer-city> <account> <account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance> </account> <account> … </account> </customer> . . </bank-1> Example of Nested Elements
Mixture of text with sub-elements is legal in XML. Example: <account> This account is seldom used any more. <account-number> A-102</account-number> <branch-name> Perryridge</branch-name> <balance>400 </balance></account> Useful for document markup, but discouraged for data representation Structure of XML Data (Cont.)
Elements can have attributes <account acct-type = “checking” > <account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance> </account> Attributes are specified by name=value pairs inside the starting tag of an element An element may have several attributes, but each attribute name can only occur once <account acct-type = “checking” monthly-fee=“5”> Attributes
Attributes Vs. Subelements • Distinction between subelement and attribute • In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents • In the context of data representation, the difference is unclear and may be confusing • Same information can be represented in two ways • <account account-number = “A-101”> …. </account> • <account> <account-number>A-101</account-number> … </account> • Suggestion: use attributes for identifiers of elements, and use subelements for contents
XML Document Schema • Database schemas constrain what information can be stored, and the data types of stored values • XML documents are not required to have an associated schema • However, schemas are very important for XML data exchange • Otherwise, a site cannot automatically interpret data received from another site • Two mechanisms for specifying XML schema • Document Type Definition (DTD) • Widely used • XML Schema • Newer, not yet widely used
Attribute specification : for each attribute Name Type of attribute CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs) more on this later Whether mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) Examples <!ATTLIST account acct-type CDATA “checking”> <!ATTLIST customer customer-id ID # REQUIRED accounts IDREFS # REQUIRED > Attribute Specification in DTD
IDs and IDREFs • An element can have at most one attribute of type ID • The ID attribute value of each element in an XML document must be distinct • Thus the ID attribute value is an object identifier • An attribute of type IDREF must contain the ID value of an element in the same document • An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document
Limitations of DTDs • No typing of text elements and attributes • All values are strings, no integers, reals, etc. • Difficult to specify unordered sets of subelements • Order is usually irrelevant in databases • (A | B)* allows specification of an unordered set, but • Cannot ensure that each of A and B occurs only once • IDs and IDREFs are untyped • The owners attribute of an account may contain a reference to another account, which is meaningless • owners attribute should ideally be constrained to refer to customer elements
XML Schema • XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports • Typing of values • E.g. integer, string, etc • Also, constraints on min/max values • User defined types • Is itself specified in XML syntax, unlike DTDs • More standard representation, but verbose • Is integrated with namespaces • Many more features • List types, uniqueness and foreign key constraints, inheritance .. • BUT: significantly more complicated than DTDs, not yet widely used.
Storage of XML Data • XML data can be stored in • Non-relational data stores • Flat files • Natural for storing XML • But has all problems discussed in Chapter 1 (no concurrency, no recovery, …) • XML database • Database built specifically for storing XML data, supporting DOM model and declarative querying • Currently no commercial-grade scaleable system • Relational databases • Data must be translated into relational form • Advantage: mature database systems • Disadvantages: overhead of translating data and queries
Storing XML in Relational Databases • Store as string • E.g. store each top level element as a string field of a tuple in a database • Use a single relation to store all elements, or • Use a separate relation for each top-level element type • E.g. account, customer, depositor • Indexing: • Store values of subelements/attributes to be indexed, such as customer-name and account-number as extra fields of the relation, and build indices • Oracle 9 supports function indices which use the result of a function as the key value. Here, the function should return the value of the required subelement/attribute • SQL server 2005 same
Storing XML in Relational Databases • Store as string • E.g. store each top level element as a string field of a tuple in a database • Benefits: • Can store any XML data even without DTD • As long as there are many top-level elements in a document, strings are small compared to full document, allowing faster access to individual elements. • Drawback: Need to parse strings to access values inside the elements; parsing is slow.
OEM model • Semi structured and XML databases can be modelled as graph-problems • Early prototypes directly supported the graph model as the physical implementation scheme. Querying the graph model was implemented using graph traversals • XML without IDREFS can be modelled as trees
Storing XML as Relations (Cont.) • Tree representation: model XML data as tree and store using relationsnodes(id, type, label, value) child (child-id, parent-id) • Each element/attribute is given a unique identifier • Type indicates element/attribute • Label specifies the tag name of the element/name of attribute • Value is the text value of the element/attribute • The relation child notes the parent-child relationships in the tree • Can add an extra attribute to child to record ordering of children • Benefit: Can store any XML data, even without DTD • Drawbacks: • Data is broken up into too many pieces, increasing space overheads • Even simple queries require a large number of joins, which can be slow
Storing XML in Relations (Cont.) • Map to relations • If DTD of document is known, you can map data to relations • Bottom-level elements and attributes are mapped to attributes of relations • A relation is created for each element type • An id attribute to store a unique id for each element • all element attributes become relation attributes • All subelements that occur only once become attributes • For text-valued subelements, store the text as attribute value • For complex subelements, store the id of the subelement • Subelements that can occur multiple times represented in a separate table • Similar to handling of multivalued attributes when converting ER diagrams to tables • Benefits: • Efficient storage • Can translate XML queries into SQL, execute efficiently, and then translate SQL results back to XML
Alternative mappings • Mapping the structure • The Edge approach • The Attribute approach • The Universal Table approach • The Normalized Universal approach • The Dataguide approach • Mapping values • Separate value tables • Inlining • Shredding
Edge approach • Use a single Edge table to capture the graph structure Edge(source, ordinal, name, flag, target) Flag: {value, reference} Keys: {source, ordinal) Index: source, {name,target}
Attribute approach • Group all attributes with the same name into one table Aname(source,ordinal,flag, target) Key: {source,ordinal} Index:{target}
Universal approach • Use the Universal Table, all attributes are stored as columns Universal(source, ord-1,flag-1,target-1, …,ord-n,flag-n,target-n) Key: source, index: target-i
Normalized Universal • Same as Universal, but factor out the repeating values Universal(source, ord-1,flag-1,target-1, …,ord-n,flag-n,target-n) Overflow_n(source,ord, flag,target) Key: source, and {source,ord} Index: target-i
Mapping values • Separate value tables • Use V_type(vid, value) tables, eg. int(vid,val), str(vid,val),….
Mapping values • Inlining • As illustrated in previous mappings, inline the values in the structure relations
Shredding • Try to recognize repeating structures and map them to separate tables • Handle the remainder through any of the previous methods
Evaluation • Some results reported by Florescu, Kossmann using a commercial DBMS on documents of 100K objects in 1999 • Database storage overhead:
Evaluation • Some results reported by Florescu, Kossmann using a commercial DBMS on documents of 100K objects in 1999 • Bulk loading:
Evaluation • Some results reported by Florescu, Kossmann using a commercial DBMS on documents of 100K objects in 1999 • Reconstruction:
The Data Semistructured data instance = a large graph
The indexing problem • The storage problem • Store the graph in a relational DBMS • Develop a new database storage structure • The indexing problem: • Input: large, irregular data graph • Output: index structure for evaluating (regular) path expressions, e.g. bib.paper.author.firstname
XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:
XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown)
XSet: Efficient query evaluation • SELECT X FROM part.name X -yes • SELECT X FROM part.supplier.name X -yes • SELECT X FROM part.*.subpart.name X -maybe • SELECT X FROM *.supplier.name X -maybe Will gain when index fits in memory
Region Algebras • structured text = text with tags (like XML) • data = sequence of characters [c1c2c3 …] • region = interval in the text • representation (x,y) = [cx,cx+1, … cy] • example: <section> … </section> • region set = a set of regions • example all <section> regions (may be nested) • region algebra = operators on region set, s1 op s2
Region algebra: some operators • s1 intersect s2 = {r | r s1, r s2} • s1 included s2 = {r | rs1, r’ s2, r r’} • s1 including s2 = {r | r s1, r’ s2, r r’} • s1 parent s2 = {r | r s1, r’ s2, r is a parent of r’} • s1 child s2 = {r | r s1, r’ s2, r is child of r’} Examples: <subpart> included <part> <part>including<subpart>
Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small
From path expressions to region expressions part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root)) Region expressions correspond to simple XPath expressions