790 likes | 915 Views
Peter Boncz CWI The Netherlands. MonetDB/XQuery: Using a Relational DBMS for XML. Peter Boncz. Pathfinder - MonetDB/XQuery. TU Delft 10-5-2005. Outline. Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery XPath steps in the pre/post plane
E N D
Peter Boncz CWI The Netherlands MonetDB/XQuery: Using a Relational DBMS for XML
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Outline • Basic XML / XQuery • Introduction of Pathfinder and MonetDB projects • Relational XQuery • XPath steps in the pre/post plane • Translating for-loops, and beyond • Optimizations • Order prevention • Loop-Lifted Staircase join • Join recognition • Outlook • Conclusions
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Outline • Basic XML / XQuery • Introduction of Pathfinder and MonetDB projects • Relational XQuery • XPath steps in the pre/post plane • Translating for-loops, and beyond • Optimizations • Order prevention • Loop-Lifted Staircase join • Join recognition • Outlook • Conclusions
XML • Standard, flexible syntax for data exchange • Regular, structured data Database content of all kinds: Inventory, billing, orders, … “Small” typed values • Irregular, unstructured text Documents of all kinds: Transcripts, books, legal briefs, … “Large” untyped values • Lingua franca of B2B Applications… • Increase access to products & services • Integrate disparate data sources • Automate business processes • … and numerous other application domains • Bio-informatics, library science, …
XML : A First Look • XML document describing catalog of books <?xml version="1.0" encoding="ISO-8859-1" ?> <catalog> <book isbn="ISBN 1565114302"> <title>No Such Thing as a Bad Day</title> <author>Hamilton Jordan</author> <publisher>Longstreet Press, Inc.</publisher> <price currency="USD">17.60</price> <review> <reviewer>Publisher</reviewer>: This book is the moving account of one man's successful battles against three cancers ...<title>No Such Thing as a Bad Day</title> is warmly recommended. </review> </book> <!-- more books and specifications --> </catalog>
XQuery 1.0 • Functional, strongly-typed query language • XQuery 1.0 = XPath 2.0 for navigation, selection, extraction + A few more expressions For-Let-Where-Order By-Return (FLWOR) XML construction Operators on types + User-defined functions & modules + Strong typing
XSLT vs. XQuery • XSLT 1.0: XML XML, HTML, Text • Loosely-typed scripting language • Format XML in HTML for display in browser • Must be highly tolerant of variability/errors in data • XQuery 1.0: XML XML • Strongly-typed query language • Large-scale database access • Must guarantee safety/correctness of operations on data • Over time, XSLT & XQuery may both serve needs of many application domains • XQuery will become a hidden, commodity language
Navigation, Selection, Extraction • Titles of all books published by Longstreet Press $cat/catalog/book[publisher=“Longstreet Press”]/title <title>No Such Thing As A Bad Day</title> • Publications with Jerome Simeon as author or editor • $cat//*[(author|editor) = “Jerome Simeon”] <book><title>XQuery from the Experts</title>…</book> <spec><title>XQuery Formal Semantics</title>…</spec>
Transformation & Construction • First author & title of books published by A/W for $b in $cat//book[publisher = “Addison Wesley”] return <awbook> { $b/author[1], $b/title } </awbook> <awbook> <author>Don Chamberlin</author> <title>XQuery from the Experts</title> </awbook>
Sequences & Iteration • Sequence constructor Return all books followed by all W3C specifications ($cat/catalog/book, $cat/catalog/W3Cspec) • XPath Expression Return all books & W3C specifications in doc order $cat/catalog/(book|W3Cspec) • For Expression • Similar to map : apply function to each item in sequence Return number of authors in each book for $b in $cat/catalog/book return fn:count($b/authors) => (3,1,2,…)
Conditional & Quantified • Conditional if //show[year >= 2000] then “A-OK!” else “Error!” • Existential quantification • Implicit meaning of predicate expressions //show[year >= 2000] • Explicit expression: //show[some $y in ./year satisfies $y >= 2000] • Universal quantification //show[every $y in year satisfies $y >= 2000]
Putting It Together • For each author, return number of books and receipts books published in past 2 years, ordered by name let $cat := fn:doc(“www.bn.com/catalog.xml“), Join $sales := fn:doc(“www.publishersweekly.com/sales.xml“) for $author in distinct-values($cat//author) Grouping let $books := $cat//book[@year >= 2000 and author = $a], S.J. $receipts := $sales/book[@isbn = $books/@isbn]/receipts order by $author Ordering return <sales> XML Construction { $author } <count> { fn:count($books) } </count> Aggregation <total> { fn:sum($receipts) } </total> </sales>
Recursive Processing • Recursive functions support recursive data <part id=“001”> <partCt count=“2” id=“001”> <part id=“002”> <partCt count=“1” id=“002”/> <part id=“003”/> => <partCt count=“0” id=“003”/> </part> </partCt> <part id=“004”/> <partCt count=“0” id=“004”/> </part> </partCt> declare function partCount($p as element(part)) as element(partCt) { <partCt count=“{ count($p/part) }”> { $p1/@id, for $p2 in $p/part return partCount($p2) } </partCt> }
XML Schema Languages • Many variants… • DTDs, XML Schema, RELAX-N/G, XDuce • … with similar goals to define • Types of literal (terminal) data • Names of elements & attribute • XQuery designed to support (all of) XML Schema • Structural & name constraints over types • Regular tree expressions over elements, attributes, atomic types
TeXQuery : Full-text extensions • Text search & querying of structured content • Limited support in XQuery 1.0 • String operators with collation sequences $cat//book[contains(review/text(), “two thumbs up”)] • Stop words, proximity searching, ranking Ex: “Tony Blair” within two words of “George Bush” • Phrases that span tags and annotations Ex: Match “Mr. English sponsored the bill” in <sponsor> Mr. English </sponsor> <footnote> for himself and <co-sponsor> Mr.Coyne </co-sponsor> </footnote> sponsored the bill in the <committee-name> Committee for Financial Services </committee-name>
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Outline • Basic XML / XQuery • Introduction of Pathfinder and MonetDB projects • Relational XQuery • XPath steps in the pre/post plane • Translating for-loops, and beyond • Optimizations • Order prevention • Loop-Lifted Staircase join • Join recognition • Outlook • Conclusions
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Outline • Basic XML / XQuery • Introduction of Pathfinder and MonetDB projects • Relational XQuery • XPath steps in the pre/post plane • Translating for-loops, and beyond • Optimizations • Order prevention • Loop-Lifted Staircase join • Join recognition • Outlook • Conclusions
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 XQuery Systems: 2 Approaches • Tree-based • Tree is basic data structure • Also on disk (if an XQuery DBMS) • Navigational Approach • Galax [Simeon..], Flux [Koch..], X-Hive • Tree Algebra Approach • TIMBER [Jagadish..] • Relational • Data shredded in relational tables • XQuery translated into database query (e.g. SQL)
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 The Pathfinder Project • Challenge / Goal: • Turn RDBMSs into efficient XQuery engines • People: • Maurice van Keulen • University of Twente • Torsten Grust, Jens Teubner • University of Konstanz • Jan Rittinger • University of Konstanz & CWI
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 The Pathfinder Project • Challenge / Goal: • Turn RDBMSs into efficient XQuery engines • People: • Maurice van Keulen • University of Twente • Torsten Grust, Jens Teubner • University of Konstanz • Jan Rittinger • University of Konstanz & CWI • Task: generate code for MonetDB
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 MonetDB: Applied CS Research at CWI • a decade of “query-intensive” application experience • image retrieval: Peter Bosch ImageSpotter • audio/video retrieval: Alex van Ballegooij RAM • XML text retrieval: de Vries / Hiemstra TIJAH • biological sequences: Arno Siebes BRICKS • XML databases: Albrecht Schmidt XMark • Grust / vKeulen Pathfinder • GIS: Wilco Quak MAGNUM • data warehousing / OLAP / data mining • SPSS DataDistilleries • Univ. Massachussetts PROXIMITY • CWI research group successfully spun off DataDistilleries (now SPSS)
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Pathfinder —MonetDB Pathfinder Parser Parser Sem. Analysis Sem. Analysis SQL Core Translation Core Translation Core to MILTranslation MIL (Query Algebra) Typechecking Typechecking Database Database Relational Algebra MonetDB
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Open Source • MonetDB + Pathfinder on Sourceforge • Mozilla License • Project Homepage • http://monetdb.cwi.nl • Developers website: • http://sf.net/projects/monetdb RoadMap • 14-apr-04: initial Beta release MonetDB/SQL • 30-sep-04: first official release MonetDB/SQL • 30-may-05: beta release of MonetDB/XQuery (i.e. Pathfinder)
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 MonetDB
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 MonetDB Particulars • Column wise fragmentation • BAT: Binary Association Tables [oid,X] • Don’t touch what you don’t need
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Binary Association Tables (BATs)
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 BAT storage as thin arrays
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 MonetDB Particulars • Column wise fragmentation • BAT: Binary Association Tables [oid,X] • Don’t touch what you don’t need • Void (virtual-oid) columns • Contain dense sequence 0,1,2,3,4,… • Require no space • Positional access (nice for XPath skipping) • pre = void
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 DBMS Architecture
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Monet: DBMS Microkernel
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 MonetDB: extensible architecture • Front-end/back-end: • support multiple data models • support multiple end-user languages • support diverse application domains
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 MonetDB: extensible architecture Pathfinder XQuery Frontend • Front-end/back-end: • support multiple data models • support multiple end-user languages • support diverse application domains
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Architecture
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Outline • Basic XML / XQuery • Introduction of Pathfinder and MonetDB projects • Relational XQuery • XPath steps in the pre/post plane • Translating for-loops, and beyond • Optimizations • Order prevention • Loop-Lifted Staircase join • Join recognition • Outlook • Conclusions
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Outline • Basic XML / XQuery • Introduction of Pathfinder and MonetDB projects • Relational XQuery • XPath steps in the pre/post plane • Translating for-loops, and beyond • MonetDB Implementation • Data structures • Optimizations • Order prevention • Loop-Lifted Staircase join • Join recognition • Outlook • Conclusions
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 XPath on and RDBMS Node-based relational encoding of XQuery's data model
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Pre/Post Pre/Level/Size done for better skipping and updates
Updates • Dense pre-numbers are nice for XPath • Positional skipping in Staircase join! • But how to handle updates?
Updates • Dense pre-numbers are nice for XPath • Positional skipping in Staircase join! • But how to handle updates? Dense Not Dense
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 XPath XQuery
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 Sequence Representation • sequence = table of items • add pos column for maintaining order • ignore polymorphism for the moment Pos Item 1 10 (10, “x”, <a/>, 10) → 2 “X” 3 pre(a) 4 10
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 For-loops: the iter column
Peter Boncz Pathfinder - MonetDB/XQuery TU Delft 10-5-2005 For-loops: the iter column