1 / 71

Evaluating Probabilistic Queries over Uncertain Matching

IEEE Intl. Conference on Data Engineering 2012. Evaluating Probabilistic Queries over Uncertain Matching. Reynold Cheng , Jian Gong, David Cheung, and Jiefeng Cheng. Background: Hidden Databases. Target query. Query interface (e.g., web form). Location = Central Price < 5M

kalyca
Download Presentation

Evaluating Probabilistic Queries over Uncertain Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IEEE Intl. Conference on Data Engineering 2012 Evaluating Probabilistic Queries over Uncertain Matching • Reynold Cheng, Jian Gong, David Cheung, and Jiefeng Cheng

  2. Background: Hidden Databases Target query Query interface (e.g., web form) Location = Central Price < 5M Size > 700 ft DB instances; hidden from users Source DB …… …… 2 2

  3. Background: schema matching source attribute Source schema S: (pname, email-addr, permanent-addr, current-addr) Schema matching (e.g., from COMA++) correspondence Target schema T: (name, email, mailing-addr, home-addr, office-addr) Target attribute Target Query

  4. Background: Schema mapping Source Query Mapping: a subset of matching S: (pname, email-addr, permanent-addr, current-addr) Many different mappings Better if we can know their confidence! T: (name, email, mailing-addr, home-addr, office-addr) Target Query

  5. Probabilistic mappings • A set of h pairs (Mi, Pr(Mi)), where Pr(Mi) is the probability that mapping Miexists [Gal06, DHY07, CGC10] Bipartite matching on similarity scores Querying on these mappings produce answers with confidence Similarity score

  6. Basic query solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’

  7. Basic query solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.2 “456”, 0.2 m2:

  8. Basic query solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.5 “456”, 0.5 m1, m2:

  9. Variants of basic solutions • Enhanced basic (or e-basic): groups identical source queries, and evaluates the distinct ones • Much better than basic! • e-MQO: attempts to improve e-basic by applying multi-query optimization [ZLFL07] on distinct source queries • Experimentally worse than e-basic, since generating a good multi-query plan for lots of mappings is expensive • We use e-basic to compare with our new algorithms

  10. Correspondence Overlap • Probabilistic mappings can have many common correspondences Q-sharing and O-sharinguses this to improve query efficiency

  11. Query-level Sharing (Q-Sharing) • If the query for mappings m1 and m2 are identical, only 1 query needs to be issued. Target query: SELECT addr FROM Person WHERE phone=‘123’ m1 and m2 Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’

  12. Q-Sharing • Example Target query: SELECT pname FROM Person WHERE addr=‘abc’ Partition the mappings Partition Tree P1: {m1, m2} P2: {m3, m4} P3: {m5} Representative mappings: {m1, m3, m5} Only 3 out of 5 mappings are used for query reformulation.

  13. Problem of Q-Sharing • Given a target query, two mappings may share only some query operators, but not all. Target query: SELECT addr FROM Person WHERE phone=‘123’ Q-sharing does not work!

  14. O-Sharing • Share query operator evaluation for two mappings with the same correspondence Target query: SELECT addr FROM Person WHERE phone=‘123’ • m2 and m3 shares the selection condition • 1. Obtain tuples with ophone=123 for m2 and m3 • 2. For m2, retrieve oaddr; for m3, retrieve haddr

  15. O-sharing: Example • Target query • Probabilistic mappings

  16. O-sharing: Example • An execution unit (e-unit) u1 captures the current status of a target query 2) Mapping set 3) next-op 1) Query plan

  17. O-sharing: Example • Execution of an e-unit u1 select next operator (details later) For m1 and m2, addr oaddr Process m1 and m2 in a batch For m3, m4, and m5, addr  haddr Process m3-m5 in a batch

  18. O-sharing: Example • New e-units u2 and u3 are generated • The process goes on until no more e-units are produced • Mapping set of u1 is partitioned • Intermediate results are generated

  19. Operator Selection • Method 1: Randomly select the next operator

  20. Operator Selection • Method 2: SNF(or Smallest Number of Partition First) chooses a target operator that leads to the fewest mapping partitions Mapped to 3 source attributes, i.e., 3 mapping partitions 4 mapping partitions

  21. Operator Selection • Method 3: SEF(or Smallest Entropy First) chooses a target operator that leads to the lowest entropy phone addr

  22. Advantages of O-sharing • Interleaves query rewriting and operator execution • May not have to consider the whole target query for every mapping, due to empty intermediate result • The current o-sharing solution supports selection, projection, join, MIN, MAX, and SUM operators

  23. Probabilistic top-k queries • Query semantic • Returns k tuples whose probabilities are the highest, among those with non-zero probabilities • Our new algorithm can prune non-answers tuples • Avoid evaluating the actual probabilities of all answer tuples • This is done by partially expanding the e-units

  24. Experimental setup • Schemas and data are about purchase orders • Source schema: TPC-H • 100MB database, with 1M tuples • 46 attributes, 8 relations • 3 Target schemas provided by COMA++ • Excel, Noris, Paragon • 48, 66, and 69 attributes • Schema matcher: COMA++ • 10 target queries: selection, projection, join, COUNT, and SUM • 100 probabilistic mappings • SEF is used for o-sharing

  25. Query performance

  26. Effect of query size

  27. Operator selection strategies • SNF is much better than Random, and SEF further improves SNF.

  28. Top-k query performance • Top-k query could improve the query performance, especially when the query returns a large set of results.

  29. Related work • Schema matching • Uncertainty is not considered in most existing work • Probabilistic schema mapping [Gal06, DHY07] • Uncertain XML schema matching [CGC10, GCC11] • Computing and storing of probabilistic XML mappings • Evaluating of probabilistic XML queries

  30. Conclusions • Probabilistic mappings can be used to handle uncertainty of schema matching • To efficiently handle table semantics, we examine q-sharing and o-sharing • They exploit the correspondences of mappings that share a query or its query operators • We plan to study the use of o-sharing on other queries (e.g., set difference and recursive queries)

  31. Thank you! Reynold Cheng (HKU) URL: http://www.cs.hku.hk/~ckcheng Email: ckcheng@cs.hku.hk

  32. References • [CGC10] R. Cheng, J. Gong, and D. Cheung. “Managing uncertainty in XML schema matching”, in ICDE, 2010 • [GCC11] J. Gong, R. Cheng, and D. Cheung. “Efficient Management of Uncertainty in XML Schema Matching”, in VLDBJ, 2011. • [Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002 • [YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004 • [DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002 • [Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006 • [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 • [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA, 2007 • [Murty86] Murty, “An algorithm for ranking all the assignment in increasing order of cost”, Operations Research, vol 16, 1986 • [RB01] Rahm et al, “A survey of approaches to automatic schema matching”, VLDB J, vol 10, 2001 • [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in SIGMOD, 2008 • [ZLFL07] J. Zhou, P. Larson, J. Freytag, and W. Lehner, “Efficient exploitation of similar subexpressions for query processing,” in SIGMOD, 2007.

  33. Probabilistic mappings • We assume that the schema matching is represented by h probabilistic mappings. • The probability of eachmapping is obtained by using a bipartite matching algorithm on the similarity scores of correspondences [CGC10]

  34. Generating the top-h mappings • Use a h-maximum bipartite matching algorithm to find the hmappings with the highest scores • See [CGC10] • Image elements are inserted to model the absence of correspondence We use approach 3

  35. Probabilistic mappings • Find the h mappings with the highest scores, using a bipartite matching algorithm [CGC10] • For each Mi, obtain Pr(Mi) by normalizing Mi’s score with the sum of scores of the h mappings

  36. Target queries

  37. Basic Solutions • e-basic is the best among the simple solutions. We thus compare it with q-sharing and o-sharing.

  38. Overlap of Mappings Fraction of no. of common correspondences over no. of distinct correspondences

  39. Operator Selection Strategies

  40. Probabilistic Query Evaluation • 2 ways to reformulate and evaluate a target query. • By-table semantic • All tuples in source tables use the same possible mapping • By-tuple semantic • Each tuple in source tables may use a differentpossible mapping • Details in Appendix B

  41. By-table semantic • All tuples in source tables use the same possible mapping • The query answers from the mapping Mi have the probability Pr(Mi) • If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability • Pr(t) = Pr(M1) + Pr(M2)

  42. By-table semantic • Example Target query: SELECT mailing-addr from T When m1 is considered, the query answer: Sunnyvale, 0.5 When m2 is considered, the query answer: Sunnyvale, 0.4 Mountain View, 0.4 When m3 is considered, the query answer: alice@, 0.1 bob@, 0.1 Final query answer (with duplicates removed): Sunnyvale, 0.9 Mountain View, 0.4 alice@, 0.1 bob@, 0.1

  43. basic solution • Evaluate the target query for every possible mapping Mi • The query answers from the mapping Mihave the probability Pr(Mi) • If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability • Pr(t) = Pr(M1) + Pr(M2) • Very expensive if the no. of mappings,|M|, is huge

  44. A basic solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’

  45. A basic solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.3 “456”, 0.3 m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.2 “456”, 0.2 m2:

  46. A basic solution • Example Target query: SELECT phone FROM Person WHERE addr=‘aaa’ “123”, 0.5 “456”, 0.5 m1, m2:

  47. 5 Algorithms for comparison • Basic: consider each possible mapping separately • e-basic: first clusters the identical source queries, then evaluate this set of distinct source queries • e-MQO: improve the e-basic by applying multi-query optimization with the set of distinct source queries • Our solutions: q-sharing and o-sharing

  48. Target Query Evaluation • 5 algorithms for querying probabilistic mappings: • basic • e-basic • e-MQO • Q-sharing • O-sharing

  49. Q-Sharing Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’ Query answer for m1 and m2: aaa, 0.5

More Related