WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006

Search Problems on the Deep Web Find round-trip flights from Chicago to New York under $500 united.com airtravel.com delta.com

Solution: Build Data Integration Systems Find round-trip flights from Chicago to New York under $500 Global query interface united.com delta.com airtravel.com comparison shopping systems “on steroid”

Current State of Affairs • Very active in both research communities & industry • Research • multidisciplinary efforts: Database, Web, KDD & AI • 10+ research groups in US, Asia & Europe • focuses: • source discovery • schema matching & integration • query processing • data extraction • Industry • Transformic, Glenbrook Networks, WebScalers, PriceGrabber, Shopping.com, MySimon, Google, …

Key Task: Schema Matching 1-1 match Complex match

Schema Matching is Ubiquitous! • Fundamental problem in numerous applications • data integration • data warehousing • peer data management • ontology merging • view integration • personal information management • Schema matching across Web sources • 30+ papers generated in past few years • Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04, ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB-04], Utah [WebDB-05], …

Schema Matching is Still Very Difficult • Must rely on properties of attributes, e.g., label & instances • Often there are little in common between matching attributes • Many attributes do not even have instances! 1-1 match Complex match

Matching Performance Greatly Hampered by Pervasive Lack of Attribute Instances • 28.1% ~ 74.6% of attributes with no instances • Extremely challenging to match these attributes • e.g., does departure city match from city or departure date? • Also difficult to match attributes with dissimilar instances • e.g., airline (with American airliners) vs. carrier (with Europeans)

Our Solution: Exploit the Web • Discover instances from the Web • e.g., Chicago, New York, etc. for departure city & from city • Borrow instances from other attributes & validate via Web • e.g., check if Air Canada is an instance of carrier with the Web

Key Idea: Question-Answering from AI • Search Web via search engines, e.g., Google • … but search engines do not understand natural language questions • Idea: form extraction queries as sentences to be completed • “Trick” search engine to complete sentences with instances • Example extraction query: “departure cities such as” attribute label: departure city

Key Idea: Question-Answering from AI • Search Google & obtain snippets: • Extract instance candidates from snippets: extraction query completion other departure cities such asBoston, Chicago and LAX available … Boston, Chicago, LAX

But Not Every Candidate is True Instance • Reason 1: Extraction queries may not be perfect • Reason 2: Web content is inherently noisy • Example: • attribute: city • extraction query: “and other cities” • extracted candidate: 150 • need to perform instance verification

Instance Verification: Outlier Detection • Goal: Remove statistical outliers (among candidates) • Step 1: Pre-processing • recognize types of instances via pattern matching & 80% rule • types: numeric & string • discard all candidates not of determined type • e.g., most of instance candidates for city are strings, so remove 150 • Step 2: Type-specific detection • perform discordance tests • test statistics, e.g., • # of words: abnormal if more than 5 words in person name • % of numeric characters: US zip code contains only digits

Instance Verification: Web Validation • Goal: Further semantic-level validation • Idea: Exploit co-occurrence statistics of label & instances • “Make: Honda; Model: Accord” • “a variety of makes such as Honda, Mitsubishi” • Form validation queries using validation patterns • e.g., “make Honda”, “makes such as Honda” Validation phrase V

Instance Verification: Web Validation • Possible measure: NumHits(V+x) • e.g., NumHits(“cities such as Los Angeles”) = 26M • Potential problems: bias towards popular instances • Use PMI(V, x), point-wise mutual information • Example: • V = “cities such as”, candidates: California, Los Angeles • NumHits(V, California) = 29 • PMI(V, Los Angeles) = 3000 * PMI(V, California) NumHits(V+x) NumHits(V) * NumHits(x)

Validate Instances from Other Attributes • Method 1: Discover k more instances from Web • then check for borrowed one (Aer Lingus for Airline) • problem: very likely Aer Lingus not among discovered instances • Method 2: Compare validation score with that of instance • problem: score for Aer Lingus may be much lower, how to decide? • Key observation: compare also to scores of non-instances • e.g., Economy (with respect to Airline)

Train Validation-Based Instance Classifier • Naïve Bayes classifier with validation-based features V1: Airlines such as V2: Airline Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C) P(+)=P(-) = ½ P(f1=1|+) = 3/4 P(f1=1|-) = 1/4 …

Validate Instances via Deep Web • Handle attributes while difficult via Web, e.g., from • Disadvantage: ambiguity when no results found

Architecture of Assisted Matching System Attribute matches Interface matcher Source interfaces with augmented instances Instance acquisition Source interfaces

Empirical Evaluation • Five domains: • Experiments: • Baseline: IceQ [Wu et al., SIGMOD-04] • Web assistance • Performance metrics: • precision (P), recall (R), & F1 (= 2PR/(P+R))

Matching Accuracy • Web assistance boosts accuracy (F1) from 89.5 to 97.5

Overhead Analysis • Reasonable overhead: 6~11 minutes across domains

Conclusion • Search problems on the Deep Web are increasingly crucial! • Novel QA-based approach to learning attribute instances • Incorporation into a state-of-art matching system • Extensive evaluation over varied real-world domains  More details: Wensheng Wu on Google

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

Presentation Transcript

Brisbane Catholic Education Master Class: Youth John Roberto, Vibrant Faith

Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax

6.S093 Visual Recognition through Machine Learning Competition

How to Promote Deep Learning

Deep learning

Li Deng Microsoft Research, Redmond

Blast output

Queries and Interfaces

Deep Learning

Bilinear Deep Learning for Image Classification

Crawling Deep Web Content Through Query Forms

Active Learning = Deep Learning

SQL: The Query Language Part 1

Astronomical Dataset Query Language (ADQL)

A Cooperative Database System (CoBase) for Query Relaxation

Personalising Learning at de Ferrers

Deep, Deep Love

Query Processing

Query Operations

Deep learning Online Training

What is Deep Learning and how it helps to Healthcare Sector?

Deep Learning and Neural Network