Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

Light-weight Domain-based Form Assistant:Querying Web Databases On The Fly Authors: Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)Published in: Proceedings of the 31st VLDB Conference, Trondheim, Norway 2005 Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

Outline • Overview • Problem Description, Motivating Example • System Architecture • Design Approaches • Query Modeling and Translation • Dynamic Predicate Mapping • Implementation - Form Assistant Toolkit • Experiments • Related Work

Problem Description • “Deep Web” • Estimated to contain 450,000 online databases (2004) • Sometimes referred to as “Invisible Web” or “Hidden Web” • Much of this is accessible only by query forms instead of static URL links • Common domains such as: books, cars, airfares

Problem Description • Often it can be useful to query multiple alternative sources in the same domain • Automation of this entails several components • One key component is dynamic query translation • Software toolkit “Form Assistant” designed to provide potential translations of user queries for alternative sources • e.g., User-entered Amazon form query automatically translated to potential Barnes & Noble form query

Problem Description • Goals of query translator: • Source-generality • Built-in translation must generally cope with new or “unseen” sources • Domain-portability • Translator must be easily customizable with domain-specific knowledge, and thus deployable for new domains

Motivating Example Source query Qs on source form S:(e.g. Amazon) Target query form T:(e.g. Barnes & Noble)

U Tom Clancy Tom Clancy Motivating Example Source query Qs on source form S Target query form T Query Translation Filter::σtitle contain“red storm” and price < 35andage > 12 Union Query Qt*:

System Architecture Form Extractor Form Extractor Source query Qs Target query form QI Domain-specific Thesaurus Attribute Matcher: Syntax-based schema matching FormAssistant(FA) Domain-specific type handlers Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Qt*

Design Approaches • Query Modeling • Vocabulary and Syntax • Query Translation • Dynamic Predicate Modeling

Query Modeling • Vocabulary • Predicate templates: { P1, P2, P3, P4, P5 } • Example: P3 P1 P5 P2 P4

Query Modeling • Example Vocabulary (predicate templates) • P1 = [author; contain; $au] • P2 = [title; contain; $ti] • P3 = [subject; contain; $su] • P4 = [isbn; contain; $isbn] • P5 = [price; between; $s, $e] • Example Syntax (valid conjunctive forms) • F1 = P1 P5 • F2 = P2 P5 • F3 = P3 P5 • F4 = P4 P5 • F5 = P1 • F6 = P2 • F7 = P3 • F8 = P4

Query Modeling • Example Vocabulary Instantiations • p1 = [author; contain; Tom Clancy] • p2 = [title; contain; red storm] • p51 = [price; between; 0-25] • p52 = [price; between; 25-45] • Corresponding Form Queries: • f1 = p1 p51 • f2 = p1 p52 • Resultant Union Query: • Qt = f1 f2

Tom Clancy Query Modeling • Syntax • Valid combination of predicate templates {F1, F2, F3, F4, F5, F6, F7, F8 } • Example (‘v’ indicates ‘valid’): F1: F2:

Query Translation • Based on semantic closeness of query predicates: • Finds minimal subsuming Cmin • Benefits of this approach: • No false positives • Minimizes false negatives • Has clear semantics, independent of DB content • Modular translation

Query Translation • Example: 0 35 s: 25 0 t1: 25 45 t2: ? 45 65 t3: Cmin 0 45 t1vt2: 0 65 t1vt2vt3:

Query Translation • Definition: • Given source query Qs and target query form T, a query Qt* is a “minimal subsuming translation” w.r.t. T if: • 1. Qt* is a validquery w.r.t T • 2. Qt* subsumes Qs • i.e., for any database instance Di, Qs(Di) ≤ Qt*(Di) • 3. Qt* is minimal • i.e., there is no query Qt such that Qt satisfies (1.) and (2.) above and Qt* subsumes Qt

Query Translation • Qt1 = (f1: p1 p51) (f2 : p1 p52) • Qt2 = f2 • Qt3= f3: p1 • Example: • Consider source query Qs in first example and three target queries Qt1,Qt2,Qt3 • Qt1 and Qt3 subsume Qs while Qt2 does not • Misses price range 0-25 • Thus can’t be the best translation Cmin • Prune Qt3 because it subsumes Qt1 • That leaves Qt1 as Cmin • p1 = [author; contain; Tom Clancy] • p51 = [price; between; 0-25] • p52 = [price; between; 25-45]

Dynamic Predicate Mapping • Tasks: • Choose operator • Fill in values • Objective: • Minimal subsuming between source and target

U Predicate Mapping Predicate Mapping Dynamic Predicate Mapping • Example: Input: output:

System Architecture (reminder) Form Extractor Form Extractor Source query Qs Target query form QI Domain-specific Thesaurus Attribute Matcher: Syntax-based schema matching FormAssistant(FA) Domain-specific type handlers Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting Target query Qt*

Implementation – Form Assistant Toolkit • Form Extractor • Parses HTML into query predicate templates [attr; op; val] • Details discussed in a different paper [3.] by same research group • Attribute Matcher (1:1) • Identifies semantically corresponding attributes between forms • Customized with domain thesaurus (indexes synonyms for commonly used concepts) • Stems (e.g., “children” -> “child) and removes stop words (e.g., “the”) • Matched by value type and synonym attributes • Predicate Mapper (discussed in previous slides) • Query Rewriter • Well-studied problem to find minimal subsuming query of given predicate-mapped query (uses approach of [5.] by Papakonstantinou, et al)

Experiments • Datasets • 447 Deep Web sources (query forms) in 8 domains • 3 “Basic” domains – each with custom thesaurus in FA • Books, Airfares, Automobiles • 5 “New” domains (for tests, these don’t have thesaurus) • Car Rentals, Jobs, Hotels, Movies, Music/Records • Test Approach • Run the FA to translate 120 form queries • Each translation test corresponds to random pairing of sources within a domain • Count correct mappings in translation suggested by FA • Indicates amount of user effort the Form Assistant has saved

Experiments • Results: Accuracy Distributions • X: % correct predicate translations; Y: % tested query forms • Forms with all 1:1 mappings had 87% perfect accuracy for Basic dataset, 85% perfect for New dataset (good domain flexibility) • Forms having complex mapping: 76%, 70% “near perfect” (Y>80%) • FA did not attempt complex (n:m) mappings, such as a full name in source mapping to separate first and last names in target For Basic dataset: For New dataset:

Experiments • Accuracy ratio: correct results per 1:1 query • Raw: includes some forms whose input form extraction step had errors • Perfect: manually forces all correct form extractions • Avg. accuracy improves for perfectly correct extraction step: • for Basic dataset, 90.4% improves to 96.1% • For New dataset, 81.1% improves to 86.7% Basic: 3 domains New: 5 domains

Experiments • Example Error in Form Extraction • delta.com form has link to alternative reservation page • “One-way & multi-city reservations” • Wrongly interpreted by Form Extractor as input field label (attribute)

Experiments • Error Distribution • % of errors caused by each component • Fewest errors are due to Attribute Matching • Most errors due to Predicate Mapping • Cited reason for PM errors is insufficient domain knowledge • Example failure: source subject value “computer science” didn’t properly map to target subject value “programming languages” • Improvement could entail better domain-specific ontology and type handlers Form Extraction 40% Attribute Matching 18% 42% Predicate Mapping

Related Work • From the same research group: • Complex Matchings (n:m) • Defines “Type Recognizer” used in Form Assistant’s Attribute Matcher, and discusses complex n:m matchings not attempted by Form Assistant: • [1.] Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach. B. He, K. C.-C. Chang, and J. Han. In Proceedings of the 2004 ACM SIGKDD Conference (KDD 2004) (Full Paper), Seattle, Washington, August 2004 • MetaQuerier System • Fuller system for both exploring (to find) and integrating (to query) Deep Web databases: • [2.] Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web. K. C.-C. Chang, B. He, and Z. Zhang. In Proceedings of the Second Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, California, January 2005

Related Work • From the same research group: • Form Extraction • As used by implementation of Form Assistant: • [3.] Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Z. Zhang, B. He, and K. C.-C. Chang. In Proceedings of the 2004 ACM SIGMOD Conference (SIGMOD 2004), Paris, France, June 2004 • 2007 thorough analysis of the Deep Web • Interesting survey of web databases and query interfaces: • [4.] Accessing the Deep Web: A Survey. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Communications of the ACM (CACM), 50(5):94-101, May 2007 • Public Datasets • Cached real world query form web pages (used in experiments): • http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8 • Additional Deep Web integration resources: • http://metaquerier.cs.uiuc.edu/repository

Related Work • Query Rewriting • As used by implementation of Form Assistant: • [5.] Y. Papakonstaninou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query translation scheme for rapid implementation of wrappers In proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases, Singapore, December 1995.

Thank you !

Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008