780 likes | 908 Views
XML: Data Driving Business?. Laks V.S.Lakshmanan, IIT Bombay and Concordia University. XML : Data Model. What is an XML Document Linearization of a tree structure Every node of the tree can have several character strings associated
E N D
XML: Data Driving Business? Laks V.S.Lakshmanan, IIT Bombay and Concordia University
XML : Data Model • What is an XML Document • Linearization of a tree structure • Every node of the tree can have several character strings associated • Info content of the document is the tree structure together with the character strings Is XML just a syntax for data interchange and serialization?
XML: Data Model Types of nodes • Element Eg. <p a1="A1" . . . an="An">c1 . . . cm</p> • Document Eg. <!DOCTYPE name [markedupdeclarations]> • Processing instruction Eg. <?xml version=“1.0”? > • Comment Eg. <!--This is a comment--> • Atomic data Eg. <Data>
What is a DTD? • Document Type Definition(DTD) serves as grammar • A document type definition specifies: • the elements that are permissible in a document of this type • for each each element the possible attributes, their range of values and defaults • for each element, the structure of its contents, including: • which element can occur and in what order • whether text characters can occur
Example of a DTD Eg: <!DOCTYPE> Bookslist[ <!ELEMENT Bookslist (book)*> <!ELEMENT book (title,author*,publisher)> <!ELEMENT title (#PCDATA)> <!ELEMENT author(#PCDATA)> <!ELEMENT publisher(#PCDATA)> ]
XML and DTD • Well formed documents • Tags should be nested properly and attributes should be unique. • Valid documents • Well formed documents that confirm to a Document Type Definition(DTD) • DTDsare used • Constrain structure • Declare entities • Provide some default values for attributes
DTD Limitations • too much document oriented • too simple and too complicated at the same time • too limited to represent complex structures • IDREFs are not typed • No notion of inheritance/sub-typing • too many ways to represent the same thing • names are global, not locals
DTD vs. Database Schema • Order is of significance in DTD and not in DB • DTD does not provide for data types • DTD cannot specify keys
XMLSchema • Why XMLSchema • Based on XML syntax • Can be parsed and manipulated like any XML document • Supports variety of data types • Allows extensions of vocabularies and inherit from elements • Provides namespace integration • Provides logical grouping of attributes
XMLSchema: An example <datatype name="PriceType"> <basetype name="decimal"/> <minExclusive>0.00</minExclusive> <scale>2</scale> </datatype> <element name="price" type="PriceType"> </element> <element name='Person'> ... </element> <element name='Employee'> <refines name='Person'/> ... </element>
XML Data • Superset of XMLSchema • Can express Database relationships too.. • Eg: <elementType id="booktable"> <element id="titleID" type="#title”/> <element type="#author”/> <element type="#pages”/> <key id="bookkey"> <keyPart href="#titleID"/> </key> </elementType>
Semistructured data • Data that is neither raw nor very strictly typed like in databases • Examples of semistructured data • Html file with one entry per restaurant that provides info on prices, addresses, styles • BibTex files • Genome and scientific databases • Online documentation
Semistructured data: Main aspects • Structure • Irregular • Implicit • Partial • Schema • Very large • Rapidly evolving • Distinction between data and schema is blurred
Semistructured data:Data model • Object Exchange Model(OEM) • Lightweight and flexible • Data representation • As a graph with objects as vertices and labels on edges • Each object has a unique object identifier • Some objects are atomic, e.g., integer, real,… • Complex objects have value as set of object references
Semistructured data: Query Languages • Lorel • Based on OQL • Eg., • Select author:X from biblio.book.author X • Computes the set of book authors • Forms a new node and connects it with edges labelled author to nodes resulting from evaluation of the path expression
Lorel: Salient features • Coercion • force comparison operators to handle comparisons between objects of different types like between string and integer • Eg. Select row:X from biblio.paper X where X.year=1998 Comment: ==>Year could have been string or integer
Lorel: Salient Features • Path expressions • Data model allows arbitrary nesting • Queries should hence be able to probe arbitrary depth • Provided by path expressions • Eg. select title:t from chapter(.section)* s, s.title t where t like "*XML*"
UnQL • Based on Edge labeled Graph Model • Coercion not supported • More precise knowledge of data needed • Pattern Usage • Eg. Select title: X where {biblio: {paper: {title: X, year:Y}}} in db, Y>1998
UnQL • Path variables • Can use path too as data • Eg. Select @P from db1 @P.X where matches(“.*(U|u)biquitin.*”,X) ==>To determine where string “ubiquitin” appears in db1
Semistructured vs. XML • Both are schema-less, self-describing • XML is ordered and semistructured data is not • XML can mix text and elements: • XML has lots of other stuff: entities, processing instructions, comments
Requirements of an XML Query Language • XML Output • Server-side processing • Query operations • Selection, Extraction, Reduction, Restructuring, Combination • No schema required • Exploit available schema • Preserve order and association • Programmatic Manipulation
Requirements of an XML Query Language • XML representation • Mutual embedding with XML • XLink and XPointer cognizant • Support for new data types • Suitable for metadata
XML Query Languages • XQL • XML-QL • Quilt
XQL • Simple expressions • //product[@maker='BSA'] : All products with attribute maker ‘BSA’ • Filters • author/address[@type='email']: Address nodes with attribute type as email • Subscripts • section[1,3 to 5]: Nodes with position 1,3,4,5
XQL • Supports boolean and set operators • q1 and q2 • q1 union q2 • Grouping • //invoice{q1} : Using invoice groups the results of q1 • Sequence • a before b • Others : node(), text(), ...
XQL: Limitations • Flattening • As the results of patterns and filters are not modeled by an intermediate relation • Restructuring • As flattening not permitted cannot restructure • Tag variables • Not supported • Sorting
XML Query Languages • XQL • XML-QL • Quilt
XML-QL • Simple examples WHERE <book> <publisher> <name>Addison-Wesley</name> </publisher> <title> $t</title> <author> $a</author> </book> IN "www.a.b.c/bib.xml" CONSTRUCT <result> <author>$a</author> <title>$t</title> </result>
XML-QL • Grouping WHERE <book> $p </> IN "www.a.b.c/bib.xml", <title > $t </>, <publisher> <name>Addison-Wesley</> </publisher> IN$p CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN$p CONSTRUCT <author> $a</> </> ==> Groups by title.
XML-QL • Tag variables WHERE <$p> <title> $t </title> <year>1995 </> <$e> Smith </> </> IN "www.a.b.c/bib.xml", $e IN {author, editor} CONSTRUCT <$p> <title> $t </title> <$e> Smith </> </> ==> List of books where Smith could be either author or editor
XML-QL • Regular Path Expressions WHERE <part*> <name>$r</> <brand>Ford</> </> IN "www.a.b.c/bib.xml" CONSTRUCT <result>$r</> ==> Gets list of names of parts irrespective of the nesting of parts in the document.
XML-QL • Skolem functions WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> </> <title> $t </> </> IN "www.a.b.c/bib.xml", CONSTRUCT <person ID=PersonID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> </> ==> PersonID is a Skolem function Generates new id for distinct value of ($fn,$ln) else appends to existing node.
XML-QL • Allows integrating data from multiple sources • Can query order as well • Provides for embedding query within data • Allows function definitions • Is relationally complete
XML-QL • Is everything fine? • Pattern specifications are too verbose • Result of the WHERE clause is a relation composed of scalar values • So cannot preserve information about hierarchy and sequence • Can hence not handle hierarchy and sequence related queries
XML Query Languages • XQL • XML-QL • Quilt
Quilt • Combines strengths of XML-QL and XQL • Derives ability to navigate and select nodes based on sequence from XQL • Binding of variables done like in XML-QL
Quilt • An example FOR $b in//book WHERE exists($b/title) AND NOT exists($b/author) RETURN$b/title ==> Lists those titles of those books which do not have author info
Quilt XML Input FOR/LET Tuples of bound var.WHERE Tuples selected RETURN XML Output Flow of data in a quilt expression
Quilt: Filtering Documents • Need to preserve the relationships among selected elements • Eg: C B A B C B B A A A C A • filter = A|B B A C B
Quilt • Can perform Sorting • Aggregation provided • Allows recursive functions
Quilt: The real power of it • Sample document <section> <section.title>Procedure</section.title> The patient was taken to the operating room where she was placed in a supine position and <Anesthesia>induced under general anesthesia. </Anesthesia> <Prep> <action>Foley catheter was placed to decompress the bladder</action> and the abdomen was then prepped and draped in sterile fashion. </Prep> <Incision> A curvilinear incision was made <Geography>in the midline immediately infraumbilical</Geography> and the subcutaneous tissue was divided <Instrument>using electrocautery.</Instrument> </Incision> The fascia was identified and <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> <Incision> The fascia was divided using <Instrument>electrocautery</Instrument> and the peritoneum was entered. </Incision> <Observation>The small bowel was identified</Observation> and <action> the <Instrument>Hasson trocar</Instrument></action> : </section>
Quilt: The real power of it • In each section with title "Procedure", what Instruments were used in the second Incision? FOR $s IN //section[section.title="Procedure"] RETURN ($s//Incision)[2]/Instrument • In each section with title "Procedure", what are the first two instruments to be used? FOR $s IN //section[section.title="Procedure"] RETURN ($s//Instrument)[1-2]
Quilt: The real power of it • In the first procedure, what happened between the first incision and the second incision? FOR $proc IN //section[section.title="Procedure"][1], $bet IN $proc//((* AFTER ($proc//incision)[1]) BEFORE ($proc//incision)[2]) RETURN $bet
XML Storage • Text files • Simple • Would require special purpose query processor • Relational databases • Ternary relations [Florescu et al] • Inlining methods [Shanmugasamudram et al] • STORED [Mary Fernandez]
XML Storage • Object Oriented databases[Sophie Cluet et al] • Native storage
XML Storage • Using Ternary relations • Edge labels are maintained in a table with the object ids that the edge connects • Value of leaf nodes are stored using yet another table
Ref Val Store XML in Ternary Relation &o1 paper &o2 year title author author &o3 &o4 &o5 &o6 “The Calculus” “…” “…” “1986”
XML Storage • DTDs converted into DTD graph • Inlining methods • Basic inlining • Shared inlining • Hybrid inlining