720 likes | 911 Views
IEEE Intl. Conference on Data Engineering 2012. Evaluating Probabilistic Queries over Uncertain Matching. Reynold Cheng , Jian Gong, David Cheung, and Jiefeng Cheng. Background: Hidden Databases. Target query. Query interface (e.g., web form). Location = Central Price < 5M
E N D
IEEE Intl. Conference on Data Engineering 2012 Evaluating Probabilistic Queries over Uncertain Matching • Reynold Cheng, Jian Gong, David Cheung, and Jiefeng Cheng
Background: Hidden Databases Target query Query interface (e.g., web form) Location = Central Price < 5M Size > 700 ft DB instances; hidden from users Source DB …… …… 2 2
Background: schema matching source attribute Source schema S: (pname, email-addr, permanent-addr, current-addr) Schema matching (e.g., from COMA++) correspondence Target schema T: (name, email, mailing-addr, home-addr, office-addr) Target attribute Target Query
Background: Schema mapping Source Query Mapping: a subset of matching S: (pname, email-addr, permanent-addr, current-addr) Many different mappings Better if we can know their confidence! T: (name, email, mailing-addr, home-addr, office-addr) Target Query
Probabilistic mappings • A set of h pairs (Mi, Pr(Mi)), where Pr(Mi) is the probability that mapping Miexists [Gal06, DHY07, CGC10] Bipartite matching on similarity scores Querying on these mappings produce answers with confidence Similarity score
Basic query solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’
Basic query solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.2 “456”, 0.2 m2:
Basic query solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.5 “456”, 0.5 m1, m2:
Variants of basic solutions • Enhanced basic (or e-basic): groups identical source queries, and evaluates the distinct ones • Much better than basic! • e-MQO: attempts to improve e-basic by applying multi-query optimization [ZLFL07] on distinct source queries • Experimentally worse than e-basic, since generating a good multi-query plan for lots of mappings is expensive • We use e-basic to compare with our new algorithms
Correspondence Overlap • Probabilistic mappings can have many common correspondences Q-sharing and O-sharinguses this to improve query efficiency
Query-level Sharing (Q-Sharing) • If the query for mappings m1 and m2 are identical, only 1 query needs to be issued. Target query: SELECT addr FROM Person WHERE phone=‘123’ m1 and m2 Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’
Q-Sharing • Example Target query: SELECT pname FROM Person WHERE addr=‘abc’ Partition the mappings Partition Tree P1: {m1, m2} P2: {m3, m4} P3: {m5} Representative mappings: {m1, m3, m5} Only 3 out of 5 mappings are used for query reformulation.
Problem of Q-Sharing • Given a target query, two mappings may share only some query operators, but not all. Target query: SELECT addr FROM Person WHERE phone=‘123’ Q-sharing does not work!
O-Sharing • Share query operator evaluation for two mappings with the same correspondence Target query: SELECT addr FROM Person WHERE phone=‘123’ • m2 and m3 shares the selection condition • 1. Obtain tuples with ophone=123 for m2 and m3 • 2. For m2, retrieve oaddr; for m3, retrieve haddr
O-sharing: Example • Target query • Probabilistic mappings
O-sharing: Example • An execution unit (e-unit) u1 captures the current status of a target query 2) Mapping set 3) next-op 1) Query plan
O-sharing: Example • Execution of an e-unit u1 select next operator (details later) For m1 and m2, addr oaddr Process m1 and m2 in a batch For m3, m4, and m5, addr haddr Process m3-m5 in a batch
O-sharing: Example • New e-units u2 and u3 are generated • The process goes on until no more e-units are produced • Mapping set of u1 is partitioned • Intermediate results are generated
Operator Selection • Method 1: Randomly select the next operator
Operator Selection • Method 2: SNF(or Smallest Number of Partition First) chooses a target operator that leads to the fewest mapping partitions Mapped to 3 source attributes, i.e., 3 mapping partitions 4 mapping partitions
Operator Selection • Method 3: SEF(or Smallest Entropy First) chooses a target operator that leads to the lowest entropy phone addr
Advantages of O-sharing • Interleaves query rewriting and operator execution • May not have to consider the whole target query for every mapping, due to empty intermediate result • The current o-sharing solution supports selection, projection, join, MIN, MAX, and SUM operators
Probabilistic top-k queries • Query semantic • Returns k tuples whose probabilities are the highest, among those with non-zero probabilities • Our new algorithm can prune non-answers tuples • Avoid evaluating the actual probabilities of all answer tuples • This is done by partially expanding the e-units
Experimental setup • Schemas and data are about purchase orders • Source schema: TPC-H • 100MB database, with 1M tuples • 46 attributes, 8 relations • 3 Target schemas provided by COMA++ • Excel, Noris, Paragon • 48, 66, and 69 attributes • Schema matcher: COMA++ • 10 target queries: selection, projection, join, COUNT, and SUM • 100 probabilistic mappings • SEF is used for o-sharing
Operator selection strategies • SNF is much better than Random, and SEF further improves SNF.
Top-k query performance • Top-k query could improve the query performance, especially when the query returns a large set of results.
Related work • Schema matching • Uncertainty is not considered in most existing work • Probabilistic schema mapping [Gal06, DHY07] • Uncertain XML schema matching [CGC10, GCC11] • Computing and storing of probabilistic XML mappings • Evaluating of probabilistic XML queries
Conclusions • Probabilistic mappings can be used to handle uncertainty of schema matching • To efficiently handle table semantics, we examine q-sharing and o-sharing • They exploit the correspondences of mappings that share a query or its query operators • We plan to study the use of o-sharing on other queries (e.g., set difference and recursive queries)
Thank you! Reynold Cheng (HKU) URL: http://www.cs.hku.hk/~ckcheng Email: ckcheng@cs.hku.hk
References • [CGC10] R. Cheng, J. Gong, and D. Cheung. “Managing uncertainty in XML schema matching”, in ICDE, 2010 • [GCC11] J. Gong, R. Cheng, and D. Cheung. “Efficient Management of Uncertainty in XML Schema Matching”, in VLDBJ, 2011. • [Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002 • [YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004 • [DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002 • [Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006 • [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 • [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA, 2007 • [Murty86] Murty, “An algorithm for ranking all the assignment in increasing order of cost”, Operations Research, vol 16, 1986 • [RB01] Rahm et al, “A survey of approaches to automatic schema matching”, VLDB J, vol 10, 2001 • [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in SIGMOD, 2008 • [ZLFL07] J. Zhou, P. Larson, J. Freytag, and W. Lehner, “Efficient exploitation of similar subexpressions for query processing,” in SIGMOD, 2007.
Probabilistic mappings • We assume that the schema matching is represented by h probabilistic mappings. • The probability of eachmapping is obtained by using a bipartite matching algorithm on the similarity scores of correspondences [CGC10]
Generating the top-h mappings • Use a h-maximum bipartite matching algorithm to find the hmappings with the highest scores • See [CGC10] • Image elements are inserted to model the absence of correspondence We use approach 3
Probabilistic mappings • Find the h mappings with the highest scores, using a bipartite matching algorithm [CGC10] • For each Mi, obtain Pr(Mi) by normalizing Mi’s score with the sum of scores of the h mappings
Basic Solutions • e-basic is the best among the simple solutions. We thus compare it with q-sharing and o-sharing.
Overlap of Mappings Fraction of no. of common correspondences over no. of distinct correspondences
Probabilistic Query Evaluation • 2 ways to reformulate and evaluate a target query. • By-table semantic • All tuples in source tables use the same possible mapping • By-tuple semantic • Each tuple in source tables may use a differentpossible mapping • Details in Appendix B
By-table semantic • All tuples in source tables use the same possible mapping • The query answers from the mapping Mi have the probability Pr(Mi) • If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability • Pr(t) = Pr(M1) + Pr(M2)
By-table semantic • Example Target query: SELECT mailing-addr from T When m1 is considered, the query answer: Sunnyvale, 0.5 When m2 is considered, the query answer: Sunnyvale, 0.4 Mountain View, 0.4 When m3 is considered, the query answer: alice@, 0.1 bob@, 0.1 Final query answer (with duplicates removed): Sunnyvale, 0.9 Mountain View, 0.4 alice@, 0.1 bob@, 0.1
basic solution • Evaluate the target query for every possible mapping Mi • The query answers from the mapping Mihave the probability Pr(Mi) • If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability • Pr(t) = Pr(M1) + Pr(M2) • Very expensive if the no. of mappings,|M|, is huge
A basic solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’
A basic solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.2 “456”, 0.2 m2:
A basic solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.5 “456”, 0.5 m1, m2:
5 Algorithms for comparison • Basic: consider each possible mapping separately • e-basic: first clusters the identical source queries, then evaluate this set of distinct source queries • e-MQO: improve the e-basic by applying multi-query optimization with the set of distinct source queries • Our solutions: q-sharing and o-sharing
Target Query Evaluation • 5 algorithms for querying probabilistic mappings: • basic • e-basic • e-MQO • Q-sharing • O-sharing
Q-Sharing Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’ Query answer for m1 and m2: aaa, 0.5