1 / 46

Integrating Web Query Results: Holistic Schema Matching

Integrating Web Query Results: Holistic Schema Matching. Shui-Lung Chuang and Kevin C. Chang. Big Picture: Deep-Web Data Integration. To integrate many sources in the “ same domain ”. author=“Donald Knuth”. <concrete math, $65> <art of comp. prog., $60> … … …. …. ….

liko
Download Presentation

Integrating Web Query Results: Holistic Schema Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang and Kevin C. Chang

  2. Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” author=“Donald Knuth” <concrete math, $65> <art of comp. prog., $60> … … … … …

  3. Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” Unified Query Form Query Forms author=“Donald Knuth” Form Extraction Form Integration <concrete math, $65> <art of comp. prog., $60> … … … … …

  4. Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” Unified Query Form Query Forms author=“Donald Knuth” Form Extraction Form Integration Query Results Integrated Query Results <concrete math, $65> <art of comp. prog., $60> … … … … … Result Extraction Result Integration

  5. Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” Unified Query Form Query Forms author=“Donald Knuth” Form Extraction Form Integration Query Results Integrated Query Results <concrete math, $65> <art of comp. prog., $60> … … … … … Result Extraction Result Integration This Study

  6. Have a look at the real data

  7. Objective: Finding Similar Fields—Schema Matching • We seek to • Integrate multiple sources • And, of course, • Be more accurate • Be more automatic – or need less pre-configured domain- or source-specific knowledge Source 1: … 2: … 3: 4: … … … …

  8. Schema Matching on Query Results • A source = A table with the content obtained by similar queries Source 1: 2:

  9. Schema Matching on Query Results • A source = A table with the content obtained by similar queries • A field = label? + values Source 1: 2: Our Price: $15.95 $13.97 …… $20.99 $35.85 ……

  10. The common instance-based matching approach • A source = A table with the content obtained by similar queries • A field = label? + values Source 1: … 2: Field similarity + Best-First matching (~ 87% per two book srcs)

  11. Problem 1: When to stop? Source 1: Need some threshold? ? 2:

  12. Problem 2: How to leverage more info? • Structure info (beyond content) Source 1: pos<(a1,a2), pos<(a1,a3),… pos>(a6,a7) num>(a6,a7) …… 2: pos<(b1,b2), pos<(b1,b3),… pos>(b5,b6) num>(b6,b5) ……

  13. Problem 2: How to leverage more info? Airfare Example • Structure info (beyond content)

  14. Problem 3: How to combine multiple sources? (1–2) (2–3) (3–4) Sources 1 (((1–2)–3)–4)) 2 3 4 … …

  15. Problem 3: How to combine multiple sources? • Linear combination Error propagation (1–2) (2–3) (3–4) (Fields a, b, c in 3 sources) Sources a b a c 1 (((1–2)–3)–4)) 2 b c 3 4 … column 1 column 2 … Our Price: $10.99 Our Price: $23.99 $24.50 $35.99 Price: $13.10 Save: $20.23 Save: 10% Our Price: $10.99 Our Price: $23.99 Jul 10, 2007 Oct 26, 2008 1. 2. Price: $13.10 Save: 10%

  16. In a nutshell, the problems are … • Needing some knowledge input to guide better matching • E.g., threshold, information about structure • Lacking a way to effectively combine multiple sources Knowledge Sources X Matching Results 1 ? Matching 2 3 4 … …

  17. Our Idea: Holistic Schema Matching • Hypothesize a domain schema model Knowledge Sources Matching Results 1 2 Domain Schema Model 3 4 … … M

  18. Our Idea: Holistic Schema Matching • Hypothesize a domain schema model that • encode the knowledge • describe all the sources Knowledge Sources Matching Results 1 2 Domain Schema Model 3 4 … … M

  19. Our Idea: Holistic Schema Matching • Hypothesize a domain schema model that • encode the knowledge • describe all the sources Turn matching multiple sources into finding the domain model to describe them Knowledge Sources Matching Results 1 2 Domain Schema Model 3 4 … … M

  20. Our Approach to Holistic Matching 1 2 Sources ? Matching Results 3 4 1 2 3 4 … …

  21. Our Approach to Holistic Matching Holistically Aggregate the matchings of all sources 1 2 Sources Meta-Matching Matching Results 3 4 1 2 3 4 … …

  22. Our Approach to Holistic Matching Holistically Aggregate the matchings of all sources Iteratively Learn the domain model from the matching and then refine … 1 2 Sources Meta-Matching Matching Results 3 4 1 2 3 4 … Refine Matching Domain Schema Model Learn from Matching …

  23. Meta-Matching: Find one matching most consistent with all Input Matchings 1–2 1–3 1–4 2–3 2–4 3–4 Meta-Matching Learn from Matching Refine Matching

  24. Meta-Matching: Find one matching most consistent with all • Generate some matching candidates Input Matchings 1–2 C9 1–3 1–4 C8 C7 2–3 2–4 C6 3–4 a1 b2 a2 b1 c2 Meta-Matching Learn from Matching Refine Matching

  25. Meta-Matching: Find one matching most consistent with all • Generate some matching candidates Input Matchings 1–2 C9 1–3 1–4 P3 a1, b2 a2, b1, c2 C8 P2 a1, b2 a2, b1 c2 C7 2–3 2–4 P1 a1 b2 a2, b1 c2 C6 3–4 a1 b2 a2 b1 c2 Meta-Matching Learn from Matching Refine Matching

  26. Meta-Matching: Find one matching most consistent with all • Generate some matching candidates • Select the most consistent one Input Matchings (IMs) 1–2 C9 1–3 1–4 P3 F-measure a1, b2 a2, b1, c2 C8 P2 a1, b2 a2, b1 c2 C7 2–3 2–4 P1 a1 b2 a2, b1 c2 C6 3–4 a1 b2 a2 b1 c2 Meta-Matching Learn from Matching Refine Matching

  27. Learn Model: The Matching => A more complete table … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching

  28. Learn Model: The Matching => A more complete table 1. A more complete set of fields in the domain … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching

  29. Learn Model: The Matching => A more complete table 2. More labels + instances => more content evidences Examples: Retail price: List price: Retail Buy new Original price …… Format Binding … … paperback hardcover Hard Cover Electronic trade paper ………. Retail Price: $20.22 List Price: $30.99 (#title) $20.99 $35.85 $40.99 …… Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching

  30. Learn Model: The Matching => A more complete table .. 3. Structure info revealed 1: pos<(a1,a2):1 , … , num>(a6,a7):1, ... , first(a1):1, first(a2):0 , … 2: pos<(b1,b2):1 , … , num>(b5,b4), ... , first(b1):1, first(b2):0 , … pos<(c1,c2):1 , … , first(c1):1, first(c2):0 , … 3: … … … … … … … … … … … … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching

  31. Learn Model: The Matching => A more complete table .. 3. Structure info revealed 1: pos<(a1,a2):1 , … , num>(a6,a7):1, ... , first(a1):1, first(a2):0 , … 2: pos<(b1,b2):1 , … , num>(b5,b4), ... , first(b1):1, first(b2):0 , … pos<(c1,c2):1 , … , first(c1):1, first(c2):0 , … 3: … … … … … … … … … … … … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) pos<(A2,A3):1, pos<(A7,A8):0.6… num>(A7,A8):1 first(A1):1 , exist(A1):0.5 , … … first(A2):0.5 , exist(A2):1 , … … 1. 2. 3. Meta-Matching Learn from Matching Refine Matching

  32. Learn Model: The Matching => A more complete table • A set of nodes, each encoding the content of one field • A set of soft constraints, encoding the structure info between nodes Domain model pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 Meta-Matching Learn from Matching Refine Matching

  33. Refine Matching: “Classify” each source to the domain model Domain model M pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 .. Source model S pos<(x1,x2):1 , … first(x1):1,exist(x1): first(x2):0 f(v1,…,vk):0/1 x1 x2 … x3 x4 Meta-Matching Learn from Matching Refine Matching

  34. Example: Correcting Matching Errors site 1: [ 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13 ]; site 2: [ 1, 20, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 3: [ 1, 2, 12, 9, 3, 17, 6, 5 ]; site 4: [ 2, 3, 6, 11, 18, 14, 5 ]; site 5: [ 2, 3, 18, 19, 6, 4, 17, 11 ]; site 6: [ 2, 3, 17, 19, 5, 14, 10, 12, 11 ]; site 7: [ 1, 2, 3, 5, 6, 18, 9, 11, 12, 13 ]; site 8: [ 2, 3, 5, 17, 6, 18, 4, 11, 12, 15, 16 ]; site 9: [ 1, 2, 3, 18, 5 ]; site 10: [ 1, 2, 3, 17, 18, 6, 11, 12, 15 ]; Domain model pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 Meta-Matching Learn from Matching Refine Matching

  35. Example: Correcting Matching Errors site 1: [ 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13 ]; site 2: [ 1, 2, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 3: [ 1, 2, 12, 9, 3, 17, 6, 5 ]; site 4: [ 2, 3, 6, 11, 18, 14, 5 ]; site 5: [ 2, 3, 18, 19, 6, 4, 17, 11 ]; site 6: [ 2, 3, 17, 19, 5, 14, 10, 12, 11 ]; site 7: [ 1, 2, 3, 5, 6, 18, 9, 11, 12, 13 ]; site 8: [ 2, 3, 5, 17, 6, 18, 4, 11, 12, 15, 16 ]; site 9: [ 1, 2, 3, 18, 5 ]; site 10: [ 1, 2, 3, 17, 18, 6, 11, 12, 15 ]; Domain model pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 site 2: [ 1, 20, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 2: [ 1, 2 , 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; Meta-Matching Learn from Matching Refine Matching

  36. As a summary, our approach works as … Spatial …… S1 S1 S2 S2 S3 S3 S4 S4 S5 S5 temporal domain model

  37. Experiment • Goals • Look at the performance of matching all sources • Look at the matching performance of individual two sources • Look at the results on extracted data

  38. Experiment Setup • Domains (each, 10 sources) • Airfare, e.g., expedia.com, united.com, etc • Book, e.g., amazon.com, bn.com, etc • Car, e.g., cars.com • Album, e.g., allmusic.com, etc • The 1st response pages for 3 queries (~300 records in a domain) • Comparison methods • ChainMatch (1-2) (2-3) (3-4) • ProgMatch (((1-2)-3)-4) • ClusMatch (1-2-3-4) by Agglomerative clustering • InitMatch (meta-matching, i.e., without iteration)

  39. Experiment Results on All Sources • The matching performance on all sources • All-source matching is better than linearly combining two-source matchings • The matching gives useful feedback to refine itself

  40. Experiment Results on All Sources • The matching performance on all sources • The performance of iterations (Converge by 3-4 iterations)

  41. Experiment Results on Two Sources Matching of All Matchings of Two • Matching all sources also helps the matching of individual two sources Airfare: PairMatch : .77 CorpusMatch : .80 HoliMatch : .95 1–2 1–3 1 2 1–4 2–3 3 4 … 2–4 3–4

  42. Experiment Results on Extracted Data • Observation • Better extraction => better matching

  43. Conclusions • We proposed and developed • Problem: Address the query result integration by the concept of Holistic Schema Matching • Approach: Develop an approach to turning the matching of (N sources)x(N sources) into iterative matching of (N sources)x(1 domain model) • Evaluation: Conduct extensive experiments to show the feasibility of the approach

  44. The End Thanks!!

  45. Implementation • A field is modeled as a graph model • Features used Examples: $35.07 Our Price: $34.07 Low Price: $23.05 ISBN: 012569586161 UPC# 014633147841 Label Value Content Features: word, integer, float date, time, punctuation Structure Features: Field dist: exist Positions: first, pos<, last, adjacent Value comparison: num>, time>

More Related