1 / 24

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. Wensheng Wu 1 , Clement Yu 2 , AnHai Doan 1 , Weiyi Meng 3 1 University of Illinois at Urbana-Champaign 2 University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France.

emiko
Download Presentation

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3 1University of Illinois at Urbana-Champaign 2University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France

  2. Access Deep Web Sources united.com airtravel.com delta.com hotwire.com

  3. Global Query Interface united.com airtravel.com delta.com hotwire.com

  4. Constructing Global Query Interface • A unified query interface with these desired features: • Conciseness - Combine semantically similar fields over source interfaces • Completeness - Retain source-specific fields • User-friendliness – Highly related fields are close together • Two-phrased integration • Interface Matching – Identify semantically similar fields • Interface Integration – Merge the source query interfaces

  5. Interface Matching – Challenges • Field A in one interface is semantically similar to field B in another interface, but have nothing in common. E.g., • sim(A,B) = sim(A,C), which field should A match? E.g., x x ?

  6. Interface Matching – Challenges (Cont’d) • 1:m mappings: E.g., • Determine matching threshold ?

  7. Existing Common Limitations • Limitation 1: Non-hierarchical modeling • Limitation 2: Do not handle 1:m mappings or handle them with low accuracy • Limitation 3: Does not allow limited user interactions • Detailed comparisons given in paper …

  8. The IceQ’s Approach [SIGMOD-04] • Hierarchical modeling • Let’s be out of “flat” land • “Greedy” is good • Always start with the most confident matching • Bridging effect • “a2” and “c2” might not look similar themselves but they might both be similar to “b3” • 1:m mappings • Aggregate and is-a types • User interaction helps in: • Interactive learning of matching threshold • Resolution of uncertain mappings 0.8 0.5 Pick this! X

  9. Hierarchical Modeling Ordered Tree Representation Source Query Interface Capture: ordering and grouping of fields

  10. Field Similarity Function • Each field may have a label, a name and a set of values, e.g., • Evaluate the similarity sim(A,B) between two fields, A and B, based on: • Linguistic similarity by label similarity, name similarity and name vs. label similarity, each measured by Cosine function • Domain similarity by domain type and domain value similarity Linguistic similarity Domain similarity

  11. Find 1:1 Mappings via Clustering Interfaces: Initial similarity matrix: (Threshold = .3) After one merge: …, final clusters: {{a1,b1,c1}, {b2,c2},{a2},{b3}}

  12. “Bridging” Effect A ? B C Observations: - It is difficult to match “vehicle” field, A, with “make” field, B - But A’s instances are similar to C’s, and C’s label is similar to B’s - Thus, C might serve as a “bridge” to connect A and B!

  13. “Bridging” Effect (Cont’d) ? ? airtravel.com hotfares.com airtickets.com Connections might also be made via labels

  14. Field Ordering-based Tie Resolution 0.35 0.35 B1 A1 A2 0.35 0.35 B2 Question: sim(A1, B1) = sim(A1, B2), which one should A1 match? Observation: the ordering of fields conveys semantics!

  15. Complex Mappings Aggregate type – contents of fields on the many side are part of the content of field on the one side Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

  16. Complex Mappings (Cont’d) Is-a type – contents of fields on the many side are sum/union of the content of field on the one side Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

  17. Complex Mappings (Cont’d) • Final 1-m phase infers new mappings: Preliminary 1-m phase: a1  (b1, b2) Clustering phase: b1  c1, b2  c2 Final 1-m phase: a1  (c1, c2)

  18. Active Learning of Thresholds • Observation: In an ideal situation, • if field A matches with some field X, then sim(A, X) > threshold T1 • if field A does not match with any field, then for any C, max{sim(A, C)} < T2, where T2 < T1 .91 .8 .73 .62 .46 .2 .03 .62 .53 .5 .48 .46 .32 .1 .87 .82 .6 .53 .5 .33 .28 Initial B: [0,.4] Drop rule: 50% List 1 List 2 List 3 List1: (1) question on .2, answer yes, update B = [0, .2], continue on list 1 (2) question on .03, answer no, update B = [.03, .2] List2: question on .1, answer yes, update B=[.03, .1] List3: no values within B Threshold set to any value between .03 and .1

  19. Interactive Resolution of Uncertain Mappings • Resolve potential homonyms • Observation: two fields are possible homonyms if their labels are highly similar while domains are not. • Determine potential synonyms • Observation: Two fields might still be similar if there are common values in their domains even if their label/domain similarities are low = x X

  20. Interactive Resolution of Uncertain Mappings • Determine potential 1:m mappings • Observation: A might still match with B and C if (a) sim(A,B) is very close to sim(A,C); (b) B and C are adjacent; and (c) A is the only field in its interface which satisfies (a) and (b) ?

  21. Empirical Evaluations Accuracy with all user interactions Accuracy with learned thresholds Automatic field matching Distribution of questions

  22. Comparison of Component Contributions 7.3% 15.4% On average, 12.6% increase in recall

  23. Summary • High accuracy of determining matching fields across multiple user interfaces • Limited use of user interactions

  24. Future Research • Improve the accuracy of determining matching fields further • Decrease the number of  user interactions • Produce unified friendly user interface • Provide such a tool on the Web

More Related