240 likes | 368 Views
An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. Wensheng Wu 1 , Clement Yu 2 , AnHai Doan 1 , Weiyi Meng 3 1 University of Illinois at Urbana-Champaign 2 University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France.
E N D
An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3 1University of Illinois at Urbana-Champaign 2University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France
Access Deep Web Sources united.com airtravel.com delta.com hotwire.com
Global Query Interface united.com airtravel.com delta.com hotwire.com
Constructing Global Query Interface • A unified query interface with these desired features: • Conciseness - Combine semantically similar fields over source interfaces • Completeness - Retain source-specific fields • User-friendliness – Highly related fields are close together • Two-phrased integration • Interface Matching – Identify semantically similar fields • Interface Integration – Merge the source query interfaces
Interface Matching – Challenges • Field A in one interface is semantically similar to field B in another interface, but have nothing in common. E.g., • sim(A,B) = sim(A,C), which field should A match? E.g., x x ?
Interface Matching – Challenges (Cont’d) • 1:m mappings: E.g., • Determine matching threshold ?
Existing Common Limitations • Limitation 1: Non-hierarchical modeling • Limitation 2: Do not handle 1:m mappings or handle them with low accuracy • Limitation 3: Does not allow limited user interactions • Detailed comparisons given in paper …
The IceQ’s Approach [SIGMOD-04] • Hierarchical modeling • Let’s be out of “flat” land • “Greedy” is good • Always start with the most confident matching • Bridging effect • “a2” and “c2” might not look similar themselves but they might both be similar to “b3” • 1:m mappings • Aggregate and is-a types • User interaction helps in: • Interactive learning of matching threshold • Resolution of uncertain mappings 0.8 0.5 Pick this! X
Hierarchical Modeling Ordered Tree Representation Source Query Interface Capture: ordering and grouping of fields
Field Similarity Function • Each field may have a label, a name and a set of values, e.g., • Evaluate the similarity sim(A,B) between two fields, A and B, based on: • Linguistic similarity by label similarity, name similarity and name vs. label similarity, each measured by Cosine function • Domain similarity by domain type and domain value similarity Linguistic similarity Domain similarity
Find 1:1 Mappings via Clustering Interfaces: Initial similarity matrix: (Threshold = .3) After one merge: …, final clusters: {{a1,b1,c1}, {b2,c2},{a2},{b3}}
“Bridging” Effect A ? B C Observations: - It is difficult to match “vehicle” field, A, with “make” field, B - But A’s instances are similar to C’s, and C’s label is similar to B’s - Thus, C might serve as a “bridge” to connect A and B!
“Bridging” Effect (Cont’d) ? ? airtravel.com hotfares.com airtickets.com Connections might also be made via labels
Field Ordering-based Tie Resolution 0.35 0.35 B1 A1 A2 0.35 0.35 B2 Question: sim(A1, B1) = sim(A1, B2), which one should A1 match? Observation: the ordering of fields conveys semantics!
Complex Mappings Aggregate type – contents of fields on the many side are part of the content of field on the one side Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics
Complex Mappings (Cont’d) Is-a type – contents of fields on the many side are sum/union of the content of field on the one side Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics
Complex Mappings (Cont’d) • Final 1-m phase infers new mappings: Preliminary 1-m phase: a1 (b1, b2) Clustering phase: b1 c1, b2 c2 Final 1-m phase: a1 (c1, c2)
Active Learning of Thresholds • Observation: In an ideal situation, • if field A matches with some field X, then sim(A, X) > threshold T1 • if field A does not match with any field, then for any C, max{sim(A, C)} < T2, where T2 < T1 .91 .8 .73 .62 .46 .2 .03 .62 .53 .5 .48 .46 .32 .1 .87 .82 .6 .53 .5 .33 .28 Initial B: [0,.4] Drop rule: 50% List 1 List 2 List 3 List1: (1) question on .2, answer yes, update B = [0, .2], continue on list 1 (2) question on .03, answer no, update B = [.03, .2] List2: question on .1, answer yes, update B=[.03, .1] List3: no values within B Threshold set to any value between .03 and .1
Interactive Resolution of Uncertain Mappings • Resolve potential homonyms • Observation: two fields are possible homonyms if their labels are highly similar while domains are not. • Determine potential synonyms • Observation: Two fields might still be similar if there are common values in their domains even if their label/domain similarities are low = x X
Interactive Resolution of Uncertain Mappings • Determine potential 1:m mappings • Observation: A might still match with B and C if (a) sim(A,B) is very close to sim(A,C); (b) B and C are adjacent; and (c) A is the only field in its interface which satisfies (a) and (b) ?
Empirical Evaluations Accuracy with all user interactions Accuracy with learned thresholds Automatic field matching Distribution of questions
Comparison of Component Contributions 7.3% 15.4% On average, 12.6% increase in recall
Summary • High accuracy of determining matching fields across multiple user interfaces • Limited use of user interactions
Future Research • Improve the accuracy of determining matching fields further • Decrease the number of user interactions • Produce unified friendly user interface • Provide such a tool on the Web