Managing Uncertainty of XML Schema Matching

Managing Uncertainty of XML Schema Matching Reynold Cheng, Jian Gong, David W. Cheung The University of Hong Kong

The data integration problem • Querying the source data through target query interface • Example: querying multiple data sources through a mediate query interface Query interface Target schema Schema mapping Data source …… …… Source schema 2 2

Creation of the schema mapping • Step 1: finding element correspondenceswith similarities between source schema and target schema • Schema matching tools can be used (COMA++) • Step 2:create schema mapping from correspondences • Match each source schema element with a target schema element • Example: Purchase Order schemas • Sample mapping • Order - ORDER • BOC - nil • BP - IP • BCN – ICN • …… • Element correspondence with similarity

Schema mapping and uncertainty • The schema mapping between schemas can be uncertain • Exponential number of possible schema mappings! • Find the top-hschema mappings with the highest scores (aggregating the similarities of contained correspondences) • Compute the probabilities of each mapping by normalizing the scores • Uncertain mappings • Example: Purchase Order schemas 4 4

Data integration reloaded • We study the problem of managing uncertainty in XML schema matching 3. Uncertain mapping generation? 2. Efficient query evaluation? 1. Representation of uncertain mappings?

Our main observation • Substantial overlapping among uncertain mappings • Shared element correspondences • Uncertain mappings • Reduce storage cost? • Reduce query time? 6 6

Validating our assumption • How much overlapping are there in real world schema mappings? • Overlapping ratio (o-ratio): the average overlap of the top-100 possible schema mappings 7 7

Our contribution • Propose block tree: a novel data structure to represent a set of uncertain schema mappings • Reduce storage cost for the mappings • Support efficient query evaluation • Can be efficiently constructed • Propose probabilistic twig query (PTQ) • Extending the twig query semantic to uncertain mappings • Efficient evaluation with the block tree • Top-k PTQ, and its computation issue • Improve the possible mapping generation process • A partitioning-based approach suitable for XML matching • Conduct experiment on real data to validate our methods

Related work • XML query evaluation • Twig query evaluation [QYD07] • Querying probabilistic XML document [KYS08] • Schema matching approaches and tools [RB01] • COMA [DR02] • Data integration • Theoretical foundation [Len02] • XML query rewriting for data integration [YP04] • Data integration with uncertainty [DHY07] • Managing uncertainty in schema matching • Find top-h schema mappings [Gal06] • Find h-maximum bipartite matching [Murty86]

Outline • Introduction • Approaches • Block tree • Query evaluation • Mapping generation • Results • Conclusion 10 10

Data model • XML schema and document [QYD07] • Node-labeled tree • Document node may carry text values • Schema mapping [DHY07] • One-to-one mapping (no conflict) • Uncertain mappings • M1: Order-ORDER, …, BCN-ICN, … • M2: Order-ORDER, …, RCN-ICN, … • … 11 11 Source document Source schema Target schema

Drawback: • Exponential number of blocks The block • Each block, which is attached to a target schema element, consists of: • C: A set of correspondences • M: A set of mappings Semantic: mappings in M share correspondences in C Block Block Block 12 12

The c-block • A c-block (constrained block) is a block which: • Contains correspondence for all elements in its sub-tree in the target schema (so that it’s more useful for query evaluation) • Contains shared mappings more than a threshold(else it’s not worthy to store it) c-block • # of possible mappings: 5 • Threshold = 0.4 13 13

The block tree • An auxiliary data structure for uncertain mappings • Indexes shared correspondence • Reduce storage cost • Creation of the block tree • Follows the structure of the target schema • Bottom-up construction with pruning • Lemma 1: • The c-blocks for an element can be • created from the c-blocks of its children. • (detail) • Lemma 2: • If an element has no c-block, then • its parent (if any) has no c-block. 14 14

The block tree Reducing the storage cost of uncertain mappings • If part of a mapping is in the block tree, • then replace it with a link • The more # of blocks, the more space may be saved 15

Outline • Introduction • Problem • Approaches • Block tree • Query evaluation • Mapping generation • Results • Conclusion 16 16

TRADITIONAL QUERY MODEL(SINGLE MAPPING) • Twig query evaluation through a target schema [YP04] • Step 1: rewrite target query (which is against the target schema) into source query, based on the schema mapping • Step 2: evaluate source query on source document • M1: Order-ORDER, BP-IP, BCN-ICN, … Evaluation Rewriting Source document Source query Target query 17 17

OUR QUERY MODEL (UNCERTAIN MAPPINGS) • Query evaluation with uncertain mappings [DHY07] • Uncertain mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)} • Baseline approach: Evaluate QT with each mapping in pM separately Source query Query answers M1 DS Q1 R1,Pr(M1) Target query … … QT Mh DS Qh Rh,Pr(Mh) Rewriting Evaluation 18 18

Drawback of baseline approach • Drawback: redundant computations • When multiple mappings, say, M1 and M2, are identicalregarding a target query QT, i.e., when R1 = R2 • In this case, only one mappings need to be considered Block m1 DS Q1 R1, Pr(M1) m2 DS Q2 R2, Pr(M2) … … 19 19 Rewriting Evaluation

Query evaluation with block tree Our intuition: utilize the sharing information in the block tree whenever it is possible Consider the root of a query Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query (only one mapping in the block is considered) Sub-query at “w”

Query example Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings is necessary

Query evaluation with block tree Consider the root of a query Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query Case 2): the root is not found, decomposethe query, invoke recursion, and joinpartial answers

Query example Case 2): the root is not found, decompose the query, invoke recursion, and join partial answers (similar with structural join) No valid block exist QT Decomposition + + QT1 QT2 QT3

Top-k probabilistic twig query The user is only interested in k answer tuple sets {Ri,Pr(Ri)}, whose probabilities are among the highest ones Notice that Pr(Ri) = Pr(Mi) Equivalently, only consider the kmappings whose probabilities are among the highest ones The block tree also supports efficient evaluation of top-k probabilistic twig query Use the previous algorithm with a set of filtered set of mappings See the paper for more details

Outline • Introduction • Approaches • Block tree • Query evaluation • Mapping generation • Results • Conclusion 25 25

The mapping generation problem Given Source schema S, target schema T A set of element correspondences (es,et) with similarities between S and T A schema mapping m between S and T consisting of a set of correspondences (es,et) et may be EMPTY, i.e., es matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m’s score is computed by aggregating the similarities of its correspondences Return h mappings m1, …, mh, whose scores are among the highest ones

Mapping generation Baseline solution Finding h-maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite Drawback: does not perform well for large XML schemas • Modeled as a bipartite • Image elements are inserted • for “self-loop”

Mapping generation Observation: XML schema matching is usually sparse Example: elements s1 and s3 match t1 and t2, while elements s2 and s4 match t4 Improvement: a divide-and-conquer approach Derive partitions (Maximal Connected Sub-Graphs) of the bipartite Find the top-h partial mappings from each partition Merge partial mappings to obtain complete mappings

Outline • Introduction • Problem • Approaches • Results • Conclusion 29 29

Dataset and results XML schemas and documents 7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans) Accompanied sample XML documents Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-method) Target query 10 hand-write queries

Results – block tree How much space does the block tree save for storing uncertain mappings? And why? When the threshold is larger, less c-blocks are created, and therefore less space is saved by the c-blocks

Results – block tree Is the block tree effective? Intuitively, larger blocks tends to be more useful, as they can answer more queries without decomposition

Results – block tree The block tree can be efficiently created Fast, and controllable

Results – query Can the block tree really improvement query performance? True when varies the number of mappings

Results – query Can it scale? Scale with number of mappings Top-k query probabilistic twig query may achieve a even better performance

Results – mapping generation Top-h mapping generation Simple algorithm with large improvement in practical (~90%) Reason: sparse bipartite, many small partitions are obtained

Conclusion • We study the problem of managing uncertainty in XML schema matching Uncertain mapping generation Efficient query evaluation Representation with block tree

Thanks! • Q & A • More discussions are welcome in the postersession! Contact: GONG Jian, Jim jgong@cs.hku.hk Department of Computer Science The University of Hong Kong 38 38

References • [Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002 • [YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004 • [DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002 • [Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006 • [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 • [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA, 2007 • [Murty86] Murty, “An algorithm for ranking all the assignment in increasing order of cost”, Operations Research, vol 16, 1986 • [RB01] Rahm et al, “A survey of approaches to automatic schema matching”, VLDB J, vol 10, 2001 • [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in SIGMOD, 2008 • …

Query rewriting Given A target twig query QT A schema mapping m between S and T, which is a set of correspondences (es,et) Mapping semantic For each sub-tree in source document DS which contains a set of source element in m, there exists a sub-tree in target document DT which contains the corresponding target elements Procedure For each element in QT, replace with a source element Connect all the source elements 40

Query evaluation and uncertainty • The uncertainty in mappings may affect query answers • Example: a source document • Uncertain mappings • M1: Order-ORDER, …, BCN-ICN, … • M2: Order-ORDER, …, RCN-ICN, … • … • Target query • Q: //ICN • which finds all ICNs (contact names • of invoice parties) in the purchase order Return by M1 41 41 Return by M2

Lemma 1 An example • Lemma 1: (conceptually) • The c-blocks for an schema element t can be • created from the c-blocks of t’s children. • (detail) 42

Results What queries do we used?

Managing Uncertainty of XML Schema Matching

Managing Uncertainty of XML Schema Matching

Presentation Transcript

XML Schema

XML Schema

XML Schema

XML Schema

XML Schema

XML Schema

XML Schema

XML Schema

XML Schema

XML Schema

XML SCHEMA

XML Schema