360 likes | 526 Views
Storing and Querying XML Documents Using Relational Databases. Mustafa Atay matay@wayne.edu Wayne State University Detroit, MI February 28, 2006. Outline of Talk. Schema Mapping Data Mapping Query Mapping Reconstruction Conclusions. What is XML? HTML vs. XML Problem Statement
E N D
Storing and Querying XML Documents Using Relational Databases Mustafa Atay matay@wayne.edu Wayne State University Detroit, MI February 28, 2006
Outline of Talk • Schema Mapping • Data Mapping • Query Mapping • Reconstruction • Conclusions • What is XML? • HTML vs. XML • Problem Statement • Schema-based Relational Approach Wayne State University
What is XML? • eXtensible Markup Language • primarily created by Jon Bosak of Sun Microsystems • officially recommended by W3C (World Wide Web Consortium) since 1998 • a simplified form of SGML (Standard Generalized Markup Language) Wayne State University
What is XML? (cont.) • a meta language • allows you to create and format your own document markups • separates content from format • a method for putting structured data into a text file; these files are • easy to read • unambiguous • extensible • platform-independent Wayne State University
<html> <body> <table border=1> <tr> <td><b><font color=red>First Value</td> <td><b><font type="Arial" color=blue>Second Value</td> <tr> <td><b>87</td> <td><b>99</td> <tr> <td><b>45</td> <td><b>67</td> <tr> <td><b>86</td> <td><b>84</td> </table> </body> </html> HTML vs. XML Wayne State University
<?xml version="1.0"?> <!DOCTYPE house SYSTEM "house.dtd"> <house year=“1980”> <entrance>front door</entrance> <entrance>back door</entrance> <rooms> <living_room> <window>double hung1</window> <window>double hung2</window> <leads_to>kitchen</leads_to> <leads_to>hallway</leads_to> </living_room> <kitchen> <window>double hung1</window> <leads_to>living_room</leads_to> </kitchen> </rooms> </house> HTML vs. XML (cont.) Wayne State University
HTML vs. XML (cont.) • HTML - uses tags and attributes - content and formatting can be placed together - tags and attributes are pre-determined and rigid - describes what a document looks like - doesn’t allow user to define content rules • XML - uses tags and attributes - content and format are separate; formatting is contained in a stylesheet - allows user to create his/her own set of tags and attributes - describes the information in a document - allows user to define content rules (DTD) Wayne State University
Why Storing and Querying XML? • XML has emerged as the standard for representing and exchanging data on the World Wide Web. • The increasing amount of XML documents requires the need to store and query XML data efficiently. Wayne State University
A Sample XML Dataset • European Bioinformatics Institute Databases • ftp://ftp.ebi.ac.uk/pub/databases/interpro/ • match.xml ~ 700MB Wayne State University
Approaches of Storing and Querying XML Documents • using Native XML repositories • Software AG’s Tamino • eXcelon’s XIS • using XML-enabled commercial database systems • Oracle XML DB • DB2 XML Extender • Microsoft SQLXML • using RDBMS/ORDBMS to store and query XML documents (Relational Approach) Wayne State University
Why to store XML in RDBMS? • to get advantage of mature RDBMS technology in efficient storage, indexing and optimization techniques • to enable companies or researchers to store and query XML data using their existing RDBMS system • to enable processing of transformed XML data using both XML and relational queries from a middleware environment Wayne State University
Relational Approach • XML-Publishing • XPERANTO - Carey et al., WebDB’00 • SilkRoute – M. Fernandez et al., WWW’00 • Schema-less approach • Edge – D. Florescu et al., IEEE DEB’99 • STORED – A. Deutsch et al., SIGMOD’99 • Schema-based approach • Basic, Shared and Hybrid inlining – J. Shanmugasundaram et al., VLDB’99 • ODTDMap – M. Atay et al., IS’06 Wayne State University
Schema-based Relational Approach • Schema Mapping • XML data model is mapped into the relational model • Data Mapping • XML documents are shredded and composed into tuples to be inserted into the relational database • Query Mapping • XML queries are translated into SQL queries • Reverse Data Mapping (Reconstruction) • Original XML document is recovered from the RDBMS Wayne State University
Schema Mapping • Schema mapping algorithm ODTDMap contains the following steps: • Simplifying DTDs • Creating and inlining DTD graphs • Generating relational schema and the schema mapping file. Wayne State University
Sample DTD – univ.dtd <!DOCTYPE univ [ <!ELEMENT univ (colleges, schools?) > <!ATTLIST univ uName CDATA #REQUIRED> <!ELEMENT colleges (college+) > <!ELEMENT college (dep*) > <!ATTLIST college cName CDATA #REQUIRED> <!ELEMENT schools (school+) > <!ELEMENT school (dep*) > <!ATTLIST school sName CDATA #REQUIRED> <!ELEMENT dep (tel?, fax?, website?) > <!ATTLIST dep dName CDATA #REQUIRED> <!ELEMENT tel (#PCDATA) > <!ELEMENT fax (#PCDATA) > <!ELEMENT website (#PCDATA) > ]> Wayne State University
Creating DTD Graph <!DOCTYPE univ [ <!ELEMENT univ (colleges, schools?) > <!ATTLIST univ uName CDATA #REQUIRED> <!ELEMENT colleges (college+) > <!ELEMENT college (dep*) > <!ATTLIST college cName CDATA #REQUIRED> <!ELEMENT schools (school+) > <!ELEMENT school (dep*) > <!ATTLIST school sName CDATA #REQUIRED> <!ELEMENT dep (tel?, fax?, website?) > <!ATTLIST dep dName CDATA #REQUIRED> <!ELEMENT tel (#PCDATA) > <!ELEMENT fax (#PCDATA) > <!ELEMENT website (#PCDATA) > ]> Wayne State University
Inlining DTD Graph Wayne State University
Generating Relational Schema Wayne State University
Data Mapping • Challenging issues of data mapping • Should respect to schema mapping • Varying document structure • Scalability • We introduced two efficient linear algorithms • OXInsert • main memory data mapping algorithm • DOM-based • SDM • streaming data mapping algorithm • SAX-based Wayne State University
Sample XML document - univ.xml <?xml version=”1.0” ?> <!DOCTYPE univ SYSTEM “univ.dtd”> <univ uname=”WSU”> <colleges> <college cname=”Science”> <dep dname=”CS”> <website>www.cs.wayne.edu</website> </dep> </college> <college cname=”Engineering”> <dep dname=”ECE”> <tel>313-5773920</tel> </dep> <dep dname=”IE”></dep> </college> <college cname=”Pharmacy”></college> </colleges> </univ> Wayne State University
XMLTree for univ.xml <?xml version=”1.0” ?> <!DOCTYPE univ SYSTEM “univ.dtd”> 1 <univ uname=”WSU”> 3 <colleges> 4 <college cname=”Science”> 6 <dep dname=”CS”> 8 <website>www.cs.wayne.edu</website> </dep> </college> 9 <college cname=”Engineering”> 11 <dep dname=”ECE”> 13 <tel>313-5773920</tel> </dep> 14 <dep dname=”IE”></dep> </college> 16 <college cname=”Pharmacy”></college> </colleges> </univ> Wayne State University
XMLTree for univ.xml <?xml version=”1.0” ?> <!DOCTYPE univ SYSTEM “univ.dtd”> 1<univ uname=”WSU”> 3<colleges> 4 <college cname=”Science”> 6 <dep dname=”CS”> 8 <website>www.cs.wayne.edu</website> </dep> </college> 9 <college cname=”Engineering”> 11 <dep dname=”ECE”> 13 <tel>313-5773920</tel> </dep> 14 <dep dname=”IE”></dep> </college> 15 <college cname=”Pharmacy”></college> </colleges> </univ> Wayne State University
Database state after univ.xml is mapped Wayne State University
Performance of OXInsert and SDM Wayne State University
Data Mapping Across Different Schema Mappings Wayne State University
Query Mapping • We translate simple XPath expressions to SQL • XPath is the core of XML query languages. • We identified 3 algorithms for query mapping • Naïve • Cluster • Containment Join Wayne State University
Naïve • Takes an XPath expression creates a nested SQL query comprised of SQL queries for each XPath step • e.g. • XPath: /univ /colleges /college /dep[@dName=‘CS’] • SQL: Select dep.ID from dep where dep.dName=‘CS’ and dep.parentID in (Select college.ID from college where college.parentID in (Select colleges.ID from univ where colleges.parentID in (Select univ.ID from univ where univ.parentID=0) ) ) Wayne State University
Cluster • A cluster is a sequence of consecutive elements stored in the same table • Takes an XPath expression and creates a nested SQL query comprised of SQL queries for each XPath cluster • e.g. • XPath: /univ /colleges/college/dep[@dName=‘CS’] • SQL: Select dep.ID from dep where dep.dName=‘CS’ and dep.parentID in (Select college.ID from college where college.parentID in (Select colleges.ID from univ) ) Wayne State University
Containment Join • Relies on the well-formedness of XML documents • Requires the pre-computation of max. ID of descendants of each element instance (endID) • Facilitates efficient evaluation of recursive XML queries • e.g. • XPath: /univ /colleges /college /dep[@dName=‘CS’] • SQL: Select dep.ID from dep, college, univ where dep.dName=‘CS’ and dep.ID>=college.ID and dep.ID<=college.endID and college.parentID=univ.colleges.ID Wayne State University
A Recursive Query Example • XPath: /univ //dep • Sub queries of the recursive query • /univ /colleges /college /dep • /univ /schools /school /dep • Naïve: 8 SQL queries + 6 joins + 1 union • Cluster: 6 SQL queries + 4 joins + 1 union • Containment Join: 1 SQL query + 1 join Select dep.ID from dep, univ where dep.ID>=univ.ID and dep.ID<=univ.endID Wayne State University
Reconstruction • In query mapping stage, the elements selected by an XML query can be returned in one of the following two modes: • Select mode: returns IDs • Reconstruct mode: returns XML subtrees • Algorithm Reconstruct reconstructs the XML subtree rooted at a given element • The importance of Reconstruction lies in two aspects: • XML subtree reconstruction has great impact on the query response time in reconstruct mode. • It demonstrates that our mapping scheme is lossless Wayne State University
Conclusions • Schema mapping [1,5] • lossless and order preserving • processing set-valued XML attributes • simple processing of recursion • Data mapping [1,3] • We described the first linear-time schema-based data mapping algorithms • We justified their effectiveness on different schema mapping algorithms Wayne State University
Conclusions (cont.) • Query mapping • We identified 3 algorithms • Our CJ algorithm outperforms the only published recursive query mapping algorithm by Krishnamurthy et al., IEEE ICDE’04 • Reconstruction [2] • We introduced an efficient reconstruction algorithm • It can be used in relational schema-based mapping unlike its rivals used in XML-publishing Wayne State University
Future Work • Extending the schema mapping to XML Schema • Extending the query mapping to XQuery • Introducing DTD/Schema constraints to the proposed mapping scheme • Incorporating access control methods to the proposed mapping scheme Wayne State University
Acknowledgements • Dr. Shiyong Lu • Dr. Farshad Fotouhi • Artem Chebotko • Dapeng Liu • Yezhou Sun Wayne State University
Publications • Mustafa Atay, Artem Chebotko, Dapeng Liu, Shiyong Lu, Farshad Fotouhi, "Efficient Schema-based XML-to-Relational Data Mapping", International Journal of Information Systems, 2006. (to appear) • Artem Chebotko, Dapeng Liu, Mustafa Atay, Shiyong Lu and Farshad Fotouhi, “Reconstructing XML Subtrees from Relational Storage of XML Documents”, in Proc. of the 2nd International Workshop on XML Schema and Data Management (XSDM’05), in conjunction with ICDE’2005, Tokyo, Japan, April, 2005 • Mustafa Atay, Yezhou Sun, Dapeng Liu, Shiyong Lu and Farshad Fotouhi, “Mapping XML Data to Relational Data: DOM-based Approach”, in Proc. of the 8th IASTED International Conference on Internet and Multimedia Systems and Applications (IMSA’2004). Kauai, Hawaii, USA. August, 2004. • Shiyong Lu, Yezhou Sun, Mustafa Atay, and Farshad Fotouhi, “On the consistency of XML DTDs”,International Journal of Data and Knowledge Engineering, 2004. • Shiyong Lu, Yezhou. Sun, Mustafa Atay, and Farshad Fotouhi, “A new inlining algorithm for mapping XML DTDs to relational schemas”, In Proc. of the First International Workshop on XML Schema and Data Management, in conjuction with the 22nd ACM International Conference on Conceptual Modeling (ER2003), Chicago, Illinois, USA, October 2003. • Shiyong Lu, Yezhou Sun, Mustafa Atay, Farshad Fotouhi, "A Sufficient and Necessary Condition for the Consistency of XML DTDs", in Proc. of the First International Workshop on XML Schema and Data Management, in conjunction with the 22nd ACM International Conference on Conceptual Modeling (ER'2003), Chicago, Illinois, USA, October, 2003. Wayne State University