230 likes | 242 Views
This paper discusses the challenges and solutions for schema matching in the context of deep-web query interfaces. The authors propose a method that utilizes the web to discover and validate attribute instances for better matching performance.
E N D
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work with AnHai Doan & Clement Yu ICDE, April 2006
Search Problems on the Deep Web Find round-trip flights from Chicago to New York under $500 united.com airtravel.com delta.com
Solution: Build Data Integration Systems Find round-trip flights from Chicago to New York under $500 Global query interface united.com delta.com airtravel.com comparison shopping systems “on steroid”
Current State of Affairs • Very active in both research communities & industry • Research • multidisciplinary efforts: Database, Web, KDD & AI • 10+ research groups in US, Asia & Europe • focuses: • source discovery • schema matching & integration • query processing • data extraction • Industry • Transformic, Glenbrook Networks, WebScalers, PriceGrabber, Shopping.com, MySimon, Google, …
Key Task: Schema Matching 1-1 match Complex match
Schema Matching is Ubiquitous! • Fundamental problem in numerous applications • data integration • data warehousing • peer data management • ontology merging • view integration • personal information management • Schema matching across Web sources • 30+ papers generated in past few years • Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04, ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB-04], Utah [WebDB-05], …
Schema Matching is Still Very Difficult • Must rely on properties of attributes, e.g., label & instances • Often there are little in common between matching attributes • Many attributes do not even have instances! 1-1 match Complex match
Matching Performance Greatly Hampered by Pervasive Lack of Attribute Instances • 28.1% ~ 74.6% of attributes with no instances • Extremely challenging to match these attributes • e.g., does departure city match from city or departure date? • Also difficult to match attributes with dissimilar instances • e.g., airline (with American airliners) vs. carrier (with Europeans)
Our Solution: Exploit the Web • Discover instances from the Web • e.g., Chicago, New York, etc. for departure city & from city • Borrow instances from other attributes & validate via Web • e.g., check if Air Canada is an instance of carrier with the Web
Key Idea: Question-Answering from AI • Search Web via search engines, e.g., Google • … but search engines do not understand natural language questions • Idea: form extraction queries as sentences to be completed • “Trick” search engine to complete sentences with instances • Example extraction query: “departure cities such as” attribute label: departure city
Key Idea: Question-Answering from AI • Search Google & obtain snippets: • Extract instance candidates from snippets: extraction query completion other departure cities such asBoston, Chicago and LAX available … Boston, Chicago, LAX
But Not Every Candidate is True Instance • Reason 1: Extraction queries may not be perfect • Reason 2: Web content is inherently noisy • Example: • attribute: city • extraction query: “and other cities” • extracted candidate: 150 • need to perform instance verification
Instance Verification: Outlier Detection • Goal: Remove statistical outliers (among candidates) • Step 1: Pre-processing • recognize types of instances via pattern matching & 80% rule • types: numeric & string • discard all candidates not of determined type • e.g., most of instance candidates for city are strings, so remove 150 • Step 2: Type-specific detection • perform discordance tests • test statistics, e.g., • # of words: abnormal if more than 5 words in person name • % of numeric characters: US zip code contains only digits
Instance Verification: Web Validation • Goal: Further semantic-level validation • Idea: Exploit co-occurrence statistics of label & instances • “Make: Honda; Model: Accord” • “a variety of makes such as Honda, Mitsubishi” • Form validation queries using validation patterns • e.g., “make Honda”, “makes such as Honda” Validation phrase V
Instance Verification: Web Validation • Possible measure: NumHits(V+x) • e.g., NumHits(“cities such as Los Angeles”) = 26M • Potential problems: bias towards popular instances • Use PMI(V, x), point-wise mutual information • Example: • V = “cities such as”, candidates: California, Los Angeles • NumHits(V, California) = 29 • PMI(V, Los Angeles) = 3000 * PMI(V, California) NumHits(V+x) NumHits(V) * NumHits(x)
Validate Instances from Other Attributes • Method 1: Discover k more instances from Web • then check for borrowed one (Aer Lingus for Airline) • problem: very likely Aer Lingus not among discovered instances • Method 2: Compare validation score with that of instance • problem: score for Aer Lingus may be much lower, how to decide? • Key observation: compare also to scores of non-instances • e.g., Economy (with respect to Airline)
Train Validation-Based Instance Classifier • Naïve Bayes classifier with validation-based features V1: Airlines such as V2: Airline Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C) P(+)=P(-) = ½ P(f1=1|+) = 3/4 P(f1=1|-) = 1/4 …
Validate Instances via Deep Web • Handle attributes while difficult via Web, e.g., from • Disadvantage: ambiguity when no results found
Architecture of Assisted Matching System Attribute matches Interface matcher Source interfaces with augmented instances Instance acquisition Source interfaces
Empirical Evaluation • Five domains: • Experiments: • Baseline: IceQ [Wu et al., SIGMOD-04] • Web assistance • Performance metrics: • precision (P), recall (R), & F1 (= 2PR/(P+R))
Matching Accuracy • Web assistance boosts accuracy (F1) from 89.5 to 97.5
Overhead Analysis • Reasonable overhead: 6~11 minutes across domains
Conclusion • Search problems on the Deep Web are increasingly crucial! • Novel QA-based approach to learning attribute instances • Incorporation into a state-of-art matching system • Extensive evaluation over varied real-world domains More details: Wensheng Wu on Google