An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3 1University of Illinois at Urbana-Champaign 2University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France

Access Deep Web Sources united.com airtravel.com delta.com hotwire.com

Global Query Interface united.com airtravel.com delta.com hotwire.com

Constructing Global Query Interface • A unified query interface with these desired features: • Conciseness - Combine semantically similar fields over source interfaces • Completeness - Retain source-specific fields • User-friendliness – Highly related fields are close together • Two-phrased integration • Interface Matching – Identify semantically similar fields • Interface Integration – Merge the source query interfaces

Interface Matching – Challenges • Field A in one interface is semantically similar to field B in another interface, but have nothing in common. E.g., • sim(A,B) = sim(A,C), which field should A match? E.g., x x ?

Interface Matching – Challenges (Cont’d) • 1:m mappings: E.g., • Determine matching threshold ?

Existing Common Limitations • Limitation 1: Non-hierarchical modeling • Limitation 2: Do not handle 1:m mappings or handle them with low accuracy • Limitation 3: Does not allow limited user interactions • Detailed comparisons given in paper …

The IceQ’s Approach [SIGMOD-04] • Hierarchical modeling • Let’s be out of “flat” land • “Greedy” is good • Always start with the most confident matching • Bridging effect • “a2” and “c2” might not look similar themselves but they might both be similar to “b3” • 1:m mappings • Aggregate and is-a types • User interaction helps in: • Interactive learning of matching threshold • Resolution of uncertain mappings 0.8 0.5 Pick this! X

Hierarchical Modeling Ordered Tree Representation Source Query Interface Capture: ordering and grouping of fields

Field Similarity Function • Each field may have a label, a name and a set of values, e.g., • Evaluate the similarity sim(A,B) between two fields, A and B, based on: • Linguistic similarity by label similarity, name similarity and name vs. label similarity, each measured by Cosine function • Domain similarity by domain type and domain value similarity Linguistic similarity Domain similarity

Find 1:1 Mappings via Clustering Interfaces: Initial similarity matrix: (Threshold = .3) After one merge: …, final clusters: {{a1,b1,c1}, {b2,c2},{a2},{b3}}

“Bridging” Effect A ? B C Observations: - It is difficult to match “vehicle” field, A, with “make” field, B - But A’s instances are similar to C’s, and C’s label is similar to B’s - Thus, C might serve as a “bridge” to connect A and B!

“Bridging” Effect (Cont’d) ? ? airtravel.com hotfares.com airtickets.com Connections might also be made via labels

Field Ordering-based Tie Resolution 0.35 0.35 B1 A1 A2 0.35 0.35 B2 Question: sim(A1, B1) = sim(A1, B2), which one should A1 match? Observation: the ordering of fields conveys semantics!

Complex Mappings Aggregate type – contents of fields on the many side are part of the content of field on the one side Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

Complex Mappings (Cont’d) Is-a type – contents of fields on the many side are sum/union of the content of field on the one side Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

Complex Mappings (Cont’d) • Final 1-m phase infers new mappings: Preliminary 1-m phase: a1  (b1, b2) Clustering phase: b1  c1, b2  c2 Final 1-m phase: a1  (c1, c2)

Active Learning of Thresholds • Observation: In an ideal situation, • if field A matches with some field X, then sim(A, X) > threshold T1 • if field A does not match with any field, then for any C, max{sim(A, C)} < T2, where T2 < T1 .91 .8 .73 .62 .46 .2 .03 .62 .53 .5 .48 .46 .32 .1 .87 .82 .6 .53 .5 .33 .28 Initial B: [0,.4] Drop rule: 50% List 1 List 2 List 3 List1: (1) question on .2, answer yes, update B = [0, .2], continue on list 1 (2) question on .03, answer no, update B = [.03, .2] List2: question on .1, answer yes, update B=[.03, .1] List3: no values within B Threshold set to any value between .03 and .1

Interactive Resolution of Uncertain Mappings • Resolve potential homonyms • Observation: two fields are possible homonyms if their labels are highly similar while domains are not. • Determine potential synonyms • Observation: Two fields might still be similar if there are common values in their domains even if their label/domain similarities are low = x X

Interactive Resolution of Uncertain Mappings • Determine potential 1:m mappings • Observation: A might still match with B and C if (a) sim(A,B) is very close to sim(A,C); (b) B and C are adjacent; and (c) A is the only field in its interface which satisfies (a) and (b) ?

Empirical Evaluations Accuracy with all user interactions Accuracy with learned thresholds Automatic field matching Distribution of questions

Comparison of Component Contributions 7.3% 15.4% On average, 12.6% increase in recall

Summary • High accuracy of determining matching fields across multiple user interfaces • Limited use of user interactions

Future Research • Improve the accuracy of determining matching fields further • Decrease the number of user interactions • Produce unified friendly user interface • Provide such a tool on the Web

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

Presentation Transcript

Towards Interactive Question Answering: An Ontology-based Approach

Organizing Structured Web Sources by Query Schemas: A Clustering Approach

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach

AN INTERACTIVE APPROACH

Statistical Schema Matching across Web Query Interfaces

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

Organic Web: An interactive web-based approach to teaching and learning organic chemistry

Integrating Self-Access with Curriculum An Activities based Approach

Stratified K-means Clustering Over A Deep Web Data Source

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

A Clustering Utility Based Approach for

An Individualized Web-Based Algebra Tutor Based on Dynamic Deep Model Tracing

A CASE-BASED APPROACH TO INTEGRATING AN INFORMATION TECHNOLOGY CURRICULUM

Merging Source Query Interfaces on Web Databases

An approach to Web Accessibility

Web Page Clustering based on Web Community Extraction

Web Based Interfaces

A Visual Approach to Semantic Query Design Using a Web-Based Graphical Query Designer

Novel approach to the particle track reconstruction based on deep learning methods

Learning Based Web Query Processing