290 likes | 302 Views
Understand XML operations, data nuances, & construction in Niagara using algebra. Learn pattern retrieval, selections, projections, joins, & element construction. Simplify queries with path expressions & projections on vertices.
E N D
An algebra for XML Leonidas Galanis, Stratis Viglas University of Wisconsin-Madison Department of Computer Sciences
Outline • What kind of operations do we need? • Why are XML data different? • How do we overcome the problems that arise? • A concrete algebra • Using this algebra inside Niagara
What do we need? • Pattern retrieval • Selections • Projections • Joins • Element Construction
So, why is it different? • Relational algebra has selections, projections and joins • Object oriented algebras have pattern-like constructs (path expressions) • Just use these, add a construction operator and we’re set! • …not really
Key underlying difference • Relational model: there is a database schema, everything is flat • Object-oriented models: there is a class definition, a known kind of hierarchy • What does XML have? • DTDs, XML Schemata can act as a schema • Most XML files out there do not conform to a DTD/XML Schema • We don’t really know of the data schema. We just know the data are there and they have some context
The data model • XML file is a DAG of vertices • Arcs coming out of each vertex • Three types of arcs: • Attribute • Element • IDRef • Each arc is named • Even more, there is an ordering, on arcs and nodes book (1) [1] author isbn (4) (2) title (3) [3] [n] [2]
Use the bare minimum • All operators operate on a set of vertices of the same type • Use relative path expressions • Use selections and joins to filter out the data • Conditions are based on path expressions • Use plain projections to project out specific elements • Output construction based on wrapping elements with a tag • Build on these principles as we go along
Example File: books.xml <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara Blues</title> <author>Galanis</author> </book> </bib>
Example File: articles.xml <proc> <article> <title>The OO7 Benchmark</title> <author>DeWitt</author> <author>Carrey</author> <author>Naughton</author> </article> <article> <title>Magic is relevant</title> <author>Ramakrishnan</author> </article> <article> <title>The Niagara Insomniac</title> <author>Viglas</author> </article> </proc>
Vertex specification • Given a vertex, follow down a path of descendant vertices • Return all reachable vertices by the path expression • Assume that given an arc, we can differentiate between element, attribute and IDRef arcs.
Vertex specification example book • Suppose we have reached vertices pointed to by “book” arcs • We want the authors of these books • So we follow the author arc • Result is a set of vertices pointed to by “author” arcs • Let’s call this operator Follow - author author author (author) author author author
Selections • Filter out the input based on some qualification • (condition) • e.g.: (book.author = “Hull”) • What are the semantics? • What kind of elements are flowing through the system? • Can we overlay multiple selections?
Selections (example) book • Suppose we want the titles of books written by a specific author • How far should we go into the initial Follow? • If we follow to book.author, then we lose access to book.title • If we follow to book, we are better off • What if we want the author as well? (i.e., only the specified author should appear in the output) • This can be a problem… author title Selection here
Query on books.xml <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib> <bib> <book isbn=“01”> <title>Foundations of Databases</title> <author>Abiteboul</author> <author>Vianu</author> <author>Hull</author> </book> <book isbn=“02”> <title>Principles of Database Systems</title> <author>Ramakrishnan</author> </book> <book isbn=“03”> <title>Niagara: A programmer’s waterfall</title> <author>Galanis</author> </book> </bib>
Proposed Solution • Permit more than one Follow operator • Change the assumption: no operation on a single type of input • A collection { } of bags [ ] of vertices • Example: • { [book1, book1.title1], [book2, book2.title2], …} • Relational analogy: • Vertex = attribute • Bag = tuple • Collection = relation
Solution • Even more, change the semantics of the Follow operator • Evaluate a specified path expression in all elements of all bags of a collection • For each qualifying element, create a new bag containing the old vertex plus the qualifying vertex • Same as un-nesting in Object-Oriented algebras { [book1, author1], [book1, author2], …, [book3, author1] } { [Foundations…, Vianu], [Foundations…, Hull], …, [Niagara…, Galanis] } (author) { [book1], [book2], [book3] } (book)
Joins • Join two collections based on some qualification • j(condition) • What is the output of a join? • [Beech, Malhotra, Ryce]: Add an IDRef arc from one vertex to the joining vertex • But, IDRef arcs are directed • So in their model, joins are not commutative IDRef
Our Solution • Each bag of the resulting collection is a concatenation of the joining bags • The same as concatenating tuples in the relational paradigm • Even more, bags are unordered
Problems • Suppose we are operating on two streams: books and articles • We have joined on the author • We want a selection on the book’s title • Using relative path expressions, what path expression are we going to specify? (title = Niagara) j(author = author) book article
Possible solution • Use absolute path expressions • Now we can distinguish between different sources • But what if we can evaluate the path expression on different elements of the bag? • For instance, given bags of [book, book.author], book.author.lastname can be evaluated on both elements of the bag • Choose the element of the bag with the greatest common prefix for evaluation
Cleaner solution • The previous solution works, but implies the path expression evaluation principle • Introduce a reverse part in the path expressions • A reverse part designates backward satisfaction constraints • Examples: • lastname:book.author instructs following the lastname arc from book.author vertices • author.lastname:book instructs following the author.lastname arc from book vertices • This way, just the specification of the path expression implies on which element of the bag the path expression is to be evaluated
Projections • With the tools we have, it’s easy to project out elements • We just specify using the correct path expression which element of the bag we wish to project • Let’s call this operator Expose - • Example: • (lastname:book.author, title:book) • Expose creates element content
Element construction • We need a way to specify the vertex that encloses the projected ones • Call this operator Vertex – v • Creates the vertex, as well as the named arc that leads to it • Example: • v(book_author) book_author
One last step… • We need to be able to construct complex elements, i.e., a way to handle arbitrary nesting • Each path expression designated inside an Expose operator, can be tagged with a Vertex operation
Element construction example • v(book_info)[(v(name)[lastname:book.author], v(title)[title:book])], constructs: book_info name title lastnames… titles…
The Niagara Algebra • Six basic operators: • Source • Follow • Select • Join • Expose • Vertex • Regular path expressions used for element specification • Differentiation between tags, elements, contents • Filtering and construction operators • Assume an unordered XML data model
Source operator • Input: the initial collection • Singleton bags, each containing the root of one XML file • Output: either the initial collection, or a subset of it • The selection can also be based on conformance to a DTD or XML schema • Examples • Source(“*”): the initial collection (the “from *” clause) • Source(“foo.xml”): { [foo.xml] } • Source(“bib*.xml”): { [bib90.xml], [bib91.xml], … } • Source(“*”, “book.dtd”): { [books.xml], [morebooks.xml], … }
Putting it all together… <raghu_title>Principles…</raghu_title> v(raghu_title) “Book titles of authors named Ramakrishnan who have written an article as well” (book:title) j(author:book = author:article) (author:article = Ramakrishnan) (book) (article) s(books.xml) s(articles.xml)
Summary • Operators operate on a collection of bags of vertices • Path expressions identify vertices • Following of path expressions, Selections and Joins filter the input • XML output is constructed with Expose/Vertex operations • These are complicated data, so it’s a complicated algebra • …but it seems to work