A Metadata Integration Assistant Generator for Heterogeneous Databases

A Metadata Integration Assistant Generator for Heterogeneous Databases Young-Kwang Nam Joseph Goguen Guilian Wang

Data Integration in Synthetic Scientific Applications Applications Integrated result without inconsistency, etc. Query global unified schema/ontology Integration System local schema/ontology local schema/ontology local schema/ontology … data source 1 data source 2 data source n

Platform & System Heterogeneity OS, Hardware DBMSs, Concurrency control and recovery capabilities Syntactic & Structural Heterogeneity Machine readable aspects of representation Data models, Schemas, Semantic Heterogeneity Naming conflicts: synonyms, homonyms Scaling & precision conflicts Sampling rates, error distribution, etc. Why Difficult: Data Heterogeneity

No all-encompassing system satisfies everyone: frequent update of sources frequent change of user requirements non-published data from one’s own lab Simplicity and readability are more desirable than completeness or exhaustiveness to domain scientists Domain knowledge is crucial for solving heterogeneities query optimization Desirable to support domain scientists to do data integration on their own More Difficult: Flexible Integration

A Common Data Integration Architecture Query Result An Integrated View Materialized or Virtual Mediator Wrapper Wrapper Wrapper data source 1 data source 2 data source n …

Structural approach (Mediated schema approach) integration by generating mediated schema that characterize a set of data sources Semantic approach (Ontology-based approach) difficult to integrate structural aspects of sources from semantic perspective due to inherent embedded semantics within local schemas & implicit assumptions integration by sharing a common ontology among the differentdata sources Structural vs. Semanticwrt Mediation Level

Global-as-view approach each item in Global schema/ontology as a view (query) over source schemas/ontologies query(G) = query(f(S1, S2, …, Sn)) straightforward query rewriting Local-as-view approach Each source as a view/query over global schema/ontology query(G) = query(f1-1 (S1), f2-1(S2), …, fn-1 (Sn)) easy adding or removing sources Global-as-view vs. Local-as-viewwrt Mapping Direction

TSIMMIS (Stanford & IBM, 1995) MedMaker (Stanford, 1996) MIX (SDSC&UCSD, 2000) IM (AT&T, 1996) Clio+Garlic (IBM, 2000) DIXSE (UT, 2001) XYLEME (2001) Representative Systems • HERMES (UMD, 1994) • SIMS (USC, 1996) • Observer (UG, 1996) • Infosleuth (MCC, 1997) • COIN (MIT, 1999) • Ontobroker (Ger., 2000) • KIND (SDSC&UCSD, 2001)

Virtual Integration: retrieve data and resolve conflicts at query time, easy maintenance Structural Approach: take users’ knowledge on data semantics hidden in structural information as input to achieve semantic mediation Local-as-view: easily adds or removes sources, convenient to fit applications GUI for specifying semantic mappings through assigning same index to same meaning nodes (paths) Automatically generate DDXMI for query decomposition Semantic functions Our Approach

User query (XML query) Column or Path DDXMI query Generator/ collector Column or Path For each DB query2 queryn resultn query1 result2 result1 XML/DB engine2 XML/DB enginen XML/DB engine1 XML/DB1 XML/DB2 XML/DBn Current Prototype Architecture

Include Database or XML document name or location information Contain table columns or XML path information Function or operation name for resolving semantic issues about table columns or XML elements and attributes Distributed Database XML Metadata Interface (DDXMI)

DDXMI DTD <!ELEMENT DDXMIA (DDXMI.header, DDXMI.isequivalent, documentspec)> <!ELEMENT DDXMI.header (documentation,version,date,authorization)> <!ELEMENT documentation (#PCDATA)> <!ELEMENT version (#PCDATA)> <!ELEMENT date (#PCDATA)> <!ELEMENT authorization (#PCDATA)> <!ELEMENT DDXMI.isequivalent (source,destination*)*> <!ELEMENT source (#PCDATA)> <!ELEMENT destination (#PCDATA)> <!ELEMENT documentspec (document, (elementname,operation*)*)> <!ELEMENT document (#PCDATA)> <!ELEMENT elementname (#PCDATA)> <!ELEMENT operation (#PCDATA)>

Define a Master DTD (global schema) based on application requirements for choosing elements or tables from the distributed systems Parse the master DTD and generate a path for each element from root to current element Assign the master index number to the site element node which has the same meaning of the master DTD node May include a function name for some nodes Generate DDXMI file automatically by collecting over same index numbers How to generate DDXMI

Generate Master Index

Site1 : Book1 DTD Tree Index number function name

0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name Book1 Path Information Site1 Index Master Index • 0 book1.xml • 1 /bib/book • 11 /bib/book/price • /bib/book/author • 1211 /bib/book/author/first • 1212 /bib/book/author/last • /bib/book/title • 15 /bib/book/publisher • 16 /bib/book/editor • 161 /bib/book/editor/affiliation • 162 /bib/book/editor/last • 162 /bib/book/editor/first

Site 2 : Book2 DTD Tree

0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name Book2 Path Information Site2 Index Master Index 0 book2.xml 1 /arts/book 12 /arts/book/author 1211 /arts/book/author/firstname 1212 /arts/book/author/lastname 13 /arts/book/title 15 /arts/book/publisher

Site 3 : Book3 DTD Tree

0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name Book3 Path Information Site3 Index Master Index 0 book3.xml • /bookstore/book 11 /bookstore/book/price 12 /bookstore/book/author 1211 /bookstore/book/author/name 1212 /bookstore/book/author/name 13 /bookstore/book/title

XQL : takes a document point of view XML-QL : takes a database point of view Quilt : draws from both areas proposed by Don Chamberlin, Jonathan Robie, and Daniela Florescu Kweelt (University of Washington), a XML query engine based on Quilt, used in our prototype XQuery proposal follows Quilt closely XML Query Languages

Parse the master query, a query over the global schema If encounter a path, depending on its kind, get corresponding path name from DDXMI file and substitute it If there is no corresponding path in the DDXMI, then put it as a null value  no queries generated for that site How to generate site queries

Master index Site Index bookstore book book editor year author price title publisher price_info affiliation full_name price full_name last_name first_name DDXMI How to get site element names [In Quilt Query] 1.book  bookstore/book 2. price  bookstore/book/price_info/price  price_info/price cut!! <source>book</source> <destination>booksore/book</destination> <source>book/price</source> <destination>bookstore/book/price_info/price<destination>

1:1 Mapping Example FOR $book IN document("book.xml")//book [publisher = "Addison-Wesley"] RETURN <book>$book/title</book> Book1 Master index bib Book2 book arts book book price editor publisher title year author title publisher publisher title Book3 full_name affiliation full_name bookstore book last_name first_name title

Query Execution Result

Book1 Master index bib Book2 arts book book book price editor editor year author title publisher first last DDXMI Book3 bookstore full_name affiliation full_name <source>/book/editor/full_name</source> <destination>/bib/book/editor/last,/bib/book/editor/first</destination> book last_name first_name 1:N Mapping Example FOR $edi IN document("book.xml")//book/editor RETURN <editor>$edi/full_name</editor>

Book1 Book2 bib arts Master index book book book author author price editor year author title publisher first Book3 last firstname lastname bookstore full_name full_name book affiliation author first_name last_name name N:1 Mapping Example FOR $a IN document("book.xml")//book//author RETURN <author> $a/last_name,$a/first_name </author> <operation>fstring</operation> <operation>lstring</operation>

Query Generation Result import split as UDF_split; FUNCTION fstring($str) { split(" ",$str)[1]} FUNCTION lstring($str) { split(" ",$str)[2]} FOR $a IN document("book3.xml") //book//author RETURN <author> fstring($a/name),lstring($a/name) </author>

Book1 Master index bib Book2 arts book book book price editor price year author title publisher Book3 bookstore full_name affiliation full_name book last_name first_name price Semantic Function Involved Example FOR $book IN document("book.xml")//book RETURN <book> $book/title,$book/author,$book/price </book> <operation>div(100)</operation>

Handle attributes: one DTD has an attribute but others don’t, or an attribute in one DTD as an element in others More efficient way for generating DDXMI file automatically when there are many paths in the master DTD e.g., tree:tree mapping: if two paths are indicated as the same and have the same children, then the index numbers should be generated automatically Migrate to XML schemas, instead of DTDs Support JOIN, PRODUCT generated by queries Move to XQuery and a query engine with distributed query support Integrate the individual site query results as one return as a single data source ready for further analysis Provide mechanisms for removing redundancy Justify the semantics of the query generated Remaining Issues

Our prototype uses distributed metadata to generate a GUI tool to describe mappings between master and local databases by assigning index numbers and specifying conversion function names Uses Quilt as its XML query language. A DDXMI file is generated based on the mappings, and is used to translate queries over the virtual master database into sub-queries to local databases An experiment testing feasibility is reported in which 3 different bibliography databases are integrated. Implemented with Java Webserver and JavaCC Move to real applications, e.g. in the context of NSF project SEEK (Science Environment for Ecological Knowledge) Conclusion

A Metadata Integration Assistant Generator for Heterogeneous Databases