460 likes | 607 Views
Integrating Web Query Results: Holistic Schema Matching. Shui-Lung Chuang and Kevin C. Chang. Big Picture: Deep-Web Data Integration. To integrate many sources in the “ same domain ”. author=“Donald Knuth”. <concrete math, $65> <art of comp. prog., $60> … … …. …. ….
E N D
Integrating Web Query Results: Holistic Schema Matching Shui-Lung Chuang and Kevin C. Chang
Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” author=“Donald Knuth” <concrete math, $65> <art of comp. prog., $60> … … … … …
Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” Unified Query Form Query Forms author=“Donald Knuth” Form Extraction Form Integration <concrete math, $65> <art of comp. prog., $60> … … … … …
Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” Unified Query Form Query Forms author=“Donald Knuth” Form Extraction Form Integration Query Results Integrated Query Results <concrete math, $65> <art of comp. prog., $60> … … … … … Result Extraction Result Integration
Big Picture: Deep-Web Data Integration • To integrate many sources in the “same domain” Unified Query Form Query Forms author=“Donald Knuth” Form Extraction Form Integration Query Results Integrated Query Results <concrete math, $65> <art of comp. prog., $60> … … … … … Result Extraction Result Integration This Study
Objective: Finding Similar Fields—Schema Matching • We seek to • Integrate multiple sources • And, of course, • Be more accurate • Be more automatic – or need less pre-configured domain- or source-specific knowledge Source 1: … 2: … 3: 4: … … … …
Schema Matching on Query Results • A source = A table with the content obtained by similar queries Source 1: 2:
Schema Matching on Query Results • A source = A table with the content obtained by similar queries • A field = label? + values Source 1: 2: Our Price: $15.95 $13.97 …… $20.99 $35.85 ……
The common instance-based matching approach • A source = A table with the content obtained by similar queries • A field = label? + values Source 1: … 2: Field similarity + Best-First matching (~ 87% per two book srcs)
Problem 1: When to stop? Source 1: Need some threshold? ? 2:
Problem 2: How to leverage more info? • Structure info (beyond content) Source 1: pos<(a1,a2), pos<(a1,a3),… pos>(a6,a7) num>(a6,a7) …… 2: pos<(b1,b2), pos<(b1,b3),… pos>(b5,b6) num>(b6,b5) ……
Problem 2: How to leverage more info? Airfare Example • Structure info (beyond content)
Problem 3: How to combine multiple sources? (1–2) (2–3) (3–4) Sources 1 (((1–2)–3)–4)) 2 3 4 … …
Problem 3: How to combine multiple sources? • Linear combination Error propagation (1–2) (2–3) (3–4) (Fields a, b, c in 3 sources) Sources a b a c 1 (((1–2)–3)–4)) 2 b c 3 4 … column 1 column 2 … Our Price: $10.99 Our Price: $23.99 $24.50 $35.99 Price: $13.10 Save: $20.23 Save: 10% Our Price: $10.99 Our Price: $23.99 Jul 10, 2007 Oct 26, 2008 1. 2. Price: $13.10 Save: 10%
In a nutshell, the problems are … • Needing some knowledge input to guide better matching • E.g., threshold, information about structure • Lacking a way to effectively combine multiple sources Knowledge Sources X Matching Results 1 ? Matching 2 3 4 … …
Our Idea: Holistic Schema Matching • Hypothesize a domain schema model Knowledge Sources Matching Results 1 2 Domain Schema Model 3 4 … … M
Our Idea: Holistic Schema Matching • Hypothesize a domain schema model that • encode the knowledge • describe all the sources Knowledge Sources Matching Results 1 2 Domain Schema Model 3 4 … … M
Our Idea: Holistic Schema Matching • Hypothesize a domain schema model that • encode the knowledge • describe all the sources Turn matching multiple sources into finding the domain model to describe them Knowledge Sources Matching Results 1 2 Domain Schema Model 3 4 … … M
Our Approach to Holistic Matching 1 2 Sources ? Matching Results 3 4 1 2 3 4 … …
Our Approach to Holistic Matching Holistically Aggregate the matchings of all sources 1 2 Sources Meta-Matching Matching Results 3 4 1 2 3 4 … …
Our Approach to Holistic Matching Holistically Aggregate the matchings of all sources Iteratively Learn the domain model from the matching and then refine … 1 2 Sources Meta-Matching Matching Results 3 4 1 2 3 4 … Refine Matching Domain Schema Model Learn from Matching …
Meta-Matching: Find one matching most consistent with all Input Matchings 1–2 1–3 1–4 2–3 2–4 3–4 Meta-Matching Learn from Matching Refine Matching
Meta-Matching: Find one matching most consistent with all • Generate some matching candidates Input Matchings 1–2 C9 1–3 1–4 C8 C7 2–3 2–4 C6 3–4 a1 b2 a2 b1 c2 Meta-Matching Learn from Matching Refine Matching
Meta-Matching: Find one matching most consistent with all • Generate some matching candidates Input Matchings 1–2 C9 1–3 1–4 P3 a1, b2 a2, b1, c2 C8 P2 a1, b2 a2, b1 c2 C7 2–3 2–4 P1 a1 b2 a2, b1 c2 C6 3–4 a1 b2 a2 b1 c2 Meta-Matching Learn from Matching Refine Matching
Meta-Matching: Find one matching most consistent with all • Generate some matching candidates • Select the most consistent one Input Matchings (IMs) 1–2 C9 1–3 1–4 P3 F-measure a1, b2 a2, b1, c2 C8 P2 a1, b2 a2, b1 c2 C7 2–3 2–4 P1 a1 b2 a2, b1 c2 C6 3–4 a1 b2 a2 b1 c2 Meta-Matching Learn from Matching Refine Matching
Learn Model: The Matching => A more complete table … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching
Learn Model: The Matching => A more complete table 1. A more complete set of fields in the domain … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching
Learn Model: The Matching => A more complete table 2. More labels + instances => more content evidences Examples: Retail price: List price: Retail Buy new Original price …… Format Binding … … paperback hardcover Hard Cover Electronic trade paper ………. Retail Price: $20.22 List Price: $30.99 (#title) $20.99 $35.85 $40.99 …… Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching
Learn Model: The Matching => A more complete table .. 3. Structure info revealed 1: pos<(a1,a2):1 , … , num>(a6,a7):1, ... , first(a1):1, first(a2):0 , … 2: pos<(b1,b2):1 , … , num>(b5,b4), ... , first(b1):1, first(b2):0 , … pos<(c1,c2):1 , … , first(c1):1, first(c2):0 , … 3: … … … … … … … … … … … … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) 1. 2. 3. Meta-Matching Learn from Matching Refine Matching
Learn Model: The Matching => A more complete table .. 3. Structure info revealed 1: pos<(a1,a2):1 , … , num>(a6,a7):1, ... , first(a1):1, first(a2):0 , … 2: pos<(b1,b2):1 , … , num>(b5,b4), ... , first(b1):1, first(b2):0 , … pos<(c1,c2):1 , … , first(c1):1, first(c2):0 , … 3: … … … … … … … … … … … … … Retail Price: $20.22 List Price: $30.99 (#title) Our Price: $19.22 $19.20 (#author) pos<(A2,A3):1, pos<(A7,A8):0.6… num>(A7,A8):1 first(A1):1 , exist(A1):0.5 , … … first(A2):0.5 , exist(A2):1 , … … 1. 2. 3. Meta-Matching Learn from Matching Refine Matching
Learn Model: The Matching => A more complete table • A set of nodes, each encoding the content of one field • A set of soft constraints, encoding the structure info between nodes Domain model pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 Meta-Matching Learn from Matching Refine Matching
Refine Matching: “Classify” each source to the domain model Domain model M pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 .. Source model S pos<(x1,x2):1 , … first(x1):1,exist(x1): first(x2):0 f(v1,…,vk):0/1 x1 x2 … x3 x4 Meta-Matching Learn from Matching Refine Matching
Example: Correcting Matching Errors site 1: [ 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13 ]; site 2: [ 1, 20, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 3: [ 1, 2, 12, 9, 3, 17, 6, 5 ]; site 4: [ 2, 3, 6, 11, 18, 14, 5 ]; site 5: [ 2, 3, 18, 19, 6, 4, 17, 11 ]; site 6: [ 2, 3, 17, 19, 5, 14, 10, 12, 11 ]; site 7: [ 1, 2, 3, 5, 6, 18, 9, 11, 12, 13 ]; site 8: [ 2, 3, 5, 17, 6, 18, 4, 11, 12, 15, 16 ]; site 9: [ 1, 2, 3, 18, 5 ]; site 10: [ 1, 2, 3, 17, 18, 6, 11, 12, 15 ]; Domain model pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 Meta-Matching Learn from Matching Refine Matching
Example: Correcting Matching Errors site 1: [ 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13 ]; site 2: [ 1, 2, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 3: [ 1, 2, 12, 9, 3, 17, 6, 5 ]; site 4: [ 2, 3, 6, 11, 18, 14, 5 ]; site 5: [ 2, 3, 18, 19, 6, 4, 17, 11 ]; site 6: [ 2, 3, 17, 19, 5, 14, 10, 12, 11 ]; site 7: [ 1, 2, 3, 5, 6, 18, 9, 11, 12, 13 ]; site 8: [ 2, 3, 5, 17, 6, 18, 4, 11, 12, 15, 16 ]; site 9: [ 1, 2, 3, 18, 5 ]; site 10: [ 1, 2, 3, 17, 18, 6, 11, 12, 15 ]; Domain model pos<(A1,A2):1 , … num>(A7,A8):1 first(A1):1,exist(A1):0.5 f(v1,…,vk):c A1 A2 … A3 A4 site 2: [ 1, 20, 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; site 2: [ 1, 2 , 3, 5, 14, 6, 11, 12, 15, 13, 9, 10 ]; Meta-Matching Learn from Matching Refine Matching
As a summary, our approach works as … Spatial …… S1 S1 S2 S2 S3 S3 S4 S4 S5 S5 temporal domain model
Experiment • Goals • Look at the performance of matching all sources • Look at the matching performance of individual two sources • Look at the results on extracted data
Experiment Setup • Domains (each, 10 sources) • Airfare, e.g., expedia.com, united.com, etc • Book, e.g., amazon.com, bn.com, etc • Car, e.g., cars.com • Album, e.g., allmusic.com, etc • The 1st response pages for 3 queries (~300 records in a domain) • Comparison methods • ChainMatch (1-2) (2-3) (3-4) • ProgMatch (((1-2)-3)-4) • ClusMatch (1-2-3-4) by Agglomerative clustering • InitMatch (meta-matching, i.e., without iteration)
Experiment Results on All Sources • The matching performance on all sources • All-source matching is better than linearly combining two-source matchings • The matching gives useful feedback to refine itself
Experiment Results on All Sources • The matching performance on all sources • The performance of iterations (Converge by 3-4 iterations)
Experiment Results on Two Sources Matching of All Matchings of Two • Matching all sources also helps the matching of individual two sources Airfare: PairMatch : .77 CorpusMatch : .80 HoliMatch : .95 1–2 1–3 1 2 1–4 2–3 3 4 … 2–4 3–4
Experiment Results on Extracted Data • Observation • Better extraction => better matching
Conclusions • We proposed and developed • Problem: Address the query result integration by the concept of Holistic Schema Matching • Approach: Develop an approach to turning the matching of (N sources)x(N sources) into iterative matching of (N sources)x(1 domain model) • Evaluation: Conduct extensive experiments to show the feasibility of the approach
The End Thanks!!
Implementation • A field is modeled as a graph model • Features used Examples: $35.07 Our Price: $34.07 Low Price: $23.05 ISBN: 012569586161 UPC# 014633147841 Label Value Content Features: word, integer, float date, time, punctuation Structure Features: Field dist: exist Positions: first, pos<, last, adjacent Value comparison: num>, time>