1 / 29

An algebra for XML

An algebra for XML. Leonidas Galanis, Stratis Viglas University of Wisconsin-Madison Department of Computer Sciences. Outline. What kind of operations do we need? Why are XML data different? How do we overcome the problems that arise? A concrete algebra Using this algebra inside Niagara.

allenruth
Download Presentation

An algebra for XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An algebra for XML Leonidas Galanis, Stratis Viglas University of Wisconsin-Madison Department of Computer Sciences

  2. Outline • What kind of operations do we need? • Why are XML data different? • How do we overcome the problems that arise? • A concrete algebra • Using this algebra inside Niagara

  3. What do we need? • Pattern retrieval • Selections • Projections • Joins • Element Construction

  4. So, why is it different? • Relational algebra has selections, projections and joins • Object oriented algebras have pattern-like constructs (path expressions) • Just use these, add a construction operator and we’re set! • …not really

  5. Key underlying difference • Relational model: there is a database schema, everything is flat • Object-oriented models: there is a class definition, a known kind of hierarchy • What does XML have? • DTDs, XML Schemata can act as a schema • Most XML files out there do not conform to a DTD/XML Schema • We don’t really know of the data schema. We just know the data are there and they have some context

  6. The data model • XML file is a DAG of vertices • Arcs coming out of each vertex • Three types of arcs: • Attribute • Element • IDRef • Each arc is named • Even more, there is an ordering, on arcs and nodes book (1) [1] author isbn (4) (2) title (3) [3] [n] [2]

  7. Use the bare minimum • All operators operate on a set of vertices of the same type • Use relative path expressions • Use selections and joins to filter out the data • Conditions are based on path expressions • Use plain projections to project out specific elements • Output construction based on wrapping elements with a tag • Build on these principles as we go along

  8. Example File: books.xml <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara Blues</title> <author>Galanis</author> </book> </bib>

  9. Example File: articles.xml <proc> <article> <title>The OO7 Benchmark</title> <author>DeWitt</author> <author>Carrey</author> <author>Naughton</author> </article> <article> <title>Magic is relevant</title> <author>Ramakrishnan</author> </article> <article> <title>The Niagara Insomniac</title> <author>Viglas</author> </article> </proc>

  10. Vertex specification • Given a vertex, follow down a path of descendant vertices • Return all reachable vertices by the path expression • Assume that given an arc, we can differentiate between element, attribute and IDRef arcs.

  11. Vertex specification example book • Suppose we have reached vertices pointed to by “book” arcs • We want the authors of these books • So we follow the author arc • Result is a set of vertices pointed to by “author” arcs • Let’s call this operator Follow - author author author (author) author author author

  12. Selections • Filter out the input based on some qualification • (condition) • e.g.: (book.author = “Hull”) • What are the semantics? • What kind of elements are flowing through the system? • Can we overlay multiple selections?

  13. Selections (example) book • Suppose we want the titles of books written by a specific author • How far should we go into the initial Follow? • If we follow to book.author, then we lose access to book.title • If we follow to book, we are better off • What if we want the author as well? (i.e., only the specified author should appear in the output) • This can be a problem… author title Selection here

  14. Query on books.xml <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib>

  15. Proposed Solution • Permit more than one Follow operator • Change the assumption: no operation on a single type of input • A collection { } of bags [ ] of vertices • Example: • { [book1, book1.title1], [book2, book2.title2], …} • Relational analogy: • Vertex = attribute • Bag = tuple • Collection = relation

  16. Solution • Even more, change the semantics of the Follow operator • Evaluate a specified path expression in all elements of all bags of a collection • For each qualifying element, create a new bag containing the old vertex plus the qualifying vertex • Same as un-nesting in Object-Oriented algebras { [book1, author1], [book1, author2], …, [book3, author1] } { [Foundations…, Vianu], [Foundations…, Hull], …, [Niagara…, Galanis] } (author) { [book1], [book2], [book3] } (book)

  17. Joins • Join two collections based on some qualification • j(condition) • What is the output of a join? • [Beech, Malhotra, Ryce]: Add an IDRef arc from one vertex to the joining vertex • But, IDRef arcs are directed • So in their model, joins are not commutative IDRef

  18. Our Solution • Each bag of the resulting collection is a concatenation of the joining bags • The same as concatenating tuples in the relational paradigm • Even more, bags are unordered

  19. Problems • Suppose we are operating on two streams: books and articles • We have joined on the author • We want a selection on the book’s title • Using relative path expressions, what path expression are we going to specify? (title = Niagara) j(author = author) book article

  20. Possible solution • Use absolute path expressions • Now we can distinguish between different sources • But what if we can evaluate the path expression on different elements of the bag? • For instance, given bags of [book, book.author], book.author.lastname can be evaluated on both elements of the bag • Choose the element of the bag with the greatest common prefix for evaluation

  21. Cleaner solution • The previous solution works, but implies the path expression evaluation principle • Introduce a reverse part in the path expressions • A reverse part designates backward satisfaction constraints • Examples: • lastname:book.author instructs following the lastname arc from book.author vertices • author.lastname:book instructs following the author.lastname arc from book vertices • This way, just the specification of the path expression implies on which element of the bag the path expression is to be evaluated

  22. Projections • With the tools we have, it’s easy to project out elements • We just specify using the correct path expression which element of the bag we wish to project • Let’s call this operator Expose -  • Example: • (lastname:book.author, title:book) • Expose creates element content

  23. Element construction • We need a way to specify the vertex that encloses the projected ones • Call this operator Vertex – v • Creates the vertex, as well as the named arc that leads to it • Example: • v(book_author) book_author

  24. One last step… • We need to be able to construct complex elements, i.e., a way to handle arbitrary nesting • Each path expression designated inside an Expose operator, can be tagged with a Vertex operation

  25. Element construction example • v(book_info)[(v(name)[lastname:book.author], v(title)[title:book])], constructs: book_info name title lastnames… titles…

  26. The Niagara Algebra • Six basic operators: • Source • Follow • Select • Join • Expose • Vertex • Regular path expressions used for element specification • Differentiation between tags, elements, contents • Filtering and construction operators • Assume an unordered XML data model

  27. Source operator • Input: the initial collection • Singleton bags, each containing the root of one XML file • Output: either the initial collection, or a subset of it • The selection can also be based on conformance to a DTD or XML schema • Examples • Source(“*”): the initial collection (the “from *” clause) • Source(“foo.xml”): { [foo.xml] } • Source(“bib*.xml”): { [bib90.xml], [bib91.xml], … } • Source(“*”, “book.dtd”): { [books.xml], [morebooks.xml], … }

  28. Putting it all together… <raghu_title>Principles…</raghu_title> v(raghu_title) “Book titles of authors named Ramakrishnan who have written an article as well” (book:title) j(author:book = author:article) (author:article = Ramakrishnan) (book) (article) s(books.xml) s(articles.xml)

  29. Summary • Operators operate on a collection of bags of vertices • Path expressions identify vertices • Following of path expressions, Selections and Joins filter the input • XML output is constructed with Expose/Vertex operations • These are complicated data, so it’s a complicated algebra • …but it seems to work

More Related