Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly

Light-weight Domain-based Form Assistant:Querying Web Databases On the Fly Zhen Zhang, Bin He, and Kevin C. Chang

The Context:MetaQuerier @ UIUC Exploring and integrating the deep Web • Explorer • source discovery • source modeling • source indexing FIND sources Amazon.com Cars.com db of dbs • Integrator • source selection • schema integration • query mediation Apartments.com QUERYsources 411localte.com unified query interface

The Need: Querying alternative sources in the same domain • Sources are proliferating in the same domain • 2004 survey found 10% Web sites are “deep” • totaling 450,000 DBs on the Web • Each query can often find many useful DBs • Different query needs different sources • How to query across dynamic sources?

The Problem: Query translation on-the-fly • Challenge: • No pre-configured source-specific translation knowledge • Requirements: • Within domain: Source generality • Across domain: Domain portability

Dynamic query translation – Essential tasks • Reconcile three levels of query heterogeneities • Attribute level: schema matching • Predicate level: predicate mapping • Query level: query rewriting

Demo. Form Assistant to help navigate the deep Web.

U Tom Clancy Tom Clancy Translation objective: Closest among the valid Source query Qs on source form S Target query form T Input: • Two goals: • Syntactic valid • semantic close Query Translation Filter:σtitle contain“red storm” and price < 35andage > 12 output: Union Query Qt*:

Tom Clancy What is valid? Each source has a query model • Vocabulary: predicate templates { P1, P2, P3, P4, P5 } • Syntax: valid combination of predicate templates { F1, F2, F3, F4, F5, F6, F7, F8 } P1 P2 P3 P4 P5 F5: F6:

What is close? Define semantic closeness. • Minimal subsuming Cmin • No false positive • Miss no correct answer • Minimizing false negative • Contain fewest extra answers • Clear semantic • Database content independent • Modular translation • Reduce translation complexity ? 0 35 s: 0 25 t1: 25 45 t2: 45 65 t3: Cmin 25 65 t1 v t2: 25 65 t2 v t3:

What is close? Define semantic closeness. 0 35 s: 0 25 t1: 25 45 t2: • Minimal subsuming Cmin • No false positive: Miss no answer • Minimizing false negative: Fewest extra answers • Clear semantics: DB content independent • Modular translation: Reduce translation complexity ? 45 65 t3: Cmin 0 45 t1vt2: 0 65 t1vt2vt3:

What mechanism? Source Query Query Translation Target Query Search for closest Enumerate valid ? Source Query Target Query Cmin Attribute Match Predicate Mapping Query Rewriter

System architecture: Modular & lightweight • Modularized mechanism • Lightweight domain knowledge [ZhangHC- SIGMOD04] Form Extractor Form Extractor Source query Qs Target query form QI [RahmBernstein- VLDBJ01] [HeChang- SIGMOD03] Domain-specific Thesaurus Attribute Matcher: Syntax-based schema matching [WuYDM- SIGMOD04] ? Domain-specific type handlers Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting [Halevy-VLDBJ01] Target query Qt*

U Predicate Mapping Predicate Mapping The core challenge: Predicate mapping • Objective • Minimal subsuming • Tasks • Choose operator • Fill in values Input: output: Union of target predicate t*

Is source-specific translation applicable? price<$t  if $t<25: [price:between:0,25] elseif $t<45: … … … 1 ……… 1 ………….. adult = $t  passenger = $t … … …… 1 1 …….

Enable source-generic predicate mapping? What is the scope of translation? What is the mechanism of translation?

The right scope? Survey 150 sources for the Correspondence Matrix. • Correspondences occur within localities!

Source predicates Target template P Predicate Mapper Type Recognizer Domain Specific Handler Text Handler Numeric Handler Datetime Handler Target Predicate t* The right scope? Correspondence locality  Type-based translation • Correspondences occur within localities • Translation by type-handler

The right mechanism: Is pairwise-rule based mechanism suitable? Rule: attr<$t  if $t<25: [attr:between:0,25] elseif $t<45: … … … Template 1 n n+1 1 n new template n+1 • Adding one template needs to add 2n rules! • And need knowledge of the old templates.

More extendable mechanism? Search-driven. s Templates of same type t u evaluator evaluator Values of the type (virtual database) -infinite 0 1 +infinite Evaluate over “database” 0 35 s: 0 25 t1: 25 45 Search for closest t2: Evaluation results 25 45 t1v t2: … …

Greedy search to construct Cmin mapping • Find mapping iteratively • Each iteration, greedily choose the one covering maximal uncovered 0 35 s: 0 25 t1: 25 45 t2: 45 65 t3:

Experiments • Translating 120 queries in total • Between randomly paired sources from 8 domains • With domain thesaurus but no type handler • Accuracy as ratio of correct condition per query Extraction 40% 18% Matching 42% Mapping Basic: 3 domains New: 5 domains Average accuracy Error distribution

Conclusion • System: • Form assistant for querying Web databases • Problem • Dynamic query translation • Contributions: • Framework: Light-weight domain-based architecture • Techniques: Type-based search-driven pred. mapping

Thank You! For more information: http://metaquerier.cs.uiuc.edu kcchang@cs.uiuc.edu

Experiment: Accuracy distribution Accuracy distribution for New dataset Accuracy distribution for Basic dataset

Text handler: Search space • Conceptually, union of all target predicate • Practically, close-world assumption

Text handler: Closeness estimation • Ideally, logic reasoning • Practically, evaluation-by-materialization • Materialize query against a “complete” database 

Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly