260 likes | 1.06k Views
Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly. Zhen Zhang, Bin He, and Kevin C. Chang. The Context: MetaQuerier @ UIUC Exploring and integrating the deep Web. Explorer source discovery source modeling source indexing. FIND sources. Amazon.com.
E N D
Light-weight Domain-based Form Assistant:Querying Web Databases On the Fly Zhen Zhang, Bin He, and Kevin C. Chang
The Context:MetaQuerier @ UIUC Exploring and integrating the deep Web • Explorer • source discovery • source modeling • source indexing FIND sources Amazon.com Cars.com db of dbs • Integrator • source selection • schema integration • query mediation Apartments.com QUERYsources 411localte.com unified query interface
The Need: Querying alternative sources in the same domain • Sources are proliferating in the same domain • 2004 survey found 10% Web sites are “deep” • totaling 450,000 DBs on the Web • Each query can often find many useful DBs • Different query needs different sources • How to query across dynamic sources?
The Problem: Query translation on-the-fly • Challenge: • No pre-configured source-specific translation knowledge • Requirements: • Within domain: Source generality • Across domain: Domain portability
Dynamic query translation – Essential tasks • Reconcile three levels of query heterogeneities • Attribute level: schema matching • Predicate level: predicate mapping • Query level: query rewriting
Demo. Form Assistant to help navigate the deep Web.
U Tom Clancy Tom Clancy Translation objective: Closest among the valid Source query Qs on source form S Target query form T Input: • Two goals: • Syntactic valid • semantic close Query Translation Filter:σtitle contain“red storm” and price < 35andage > 12 output: Union Query Qt*:
Tom Clancy What is valid? Each source has a query model • Vocabulary: predicate templates { P1, P2, P3, P4, P5 } • Syntax: valid combination of predicate templates { F1, F2, F3, F4, F5, F6, F7, F8 } P1 P2 P3 P4 P5 F5: F6:
What is close? Define semantic closeness. • Minimal subsuming Cmin • No false positive • Miss no correct answer • Minimizing false negative • Contain fewest extra answers • Clear semantic • Database content independent • Modular translation • Reduce translation complexity ? 0 35 s: 0 25 t1: 25 45 t2: 45 65 t3: Cmin 25 65 t1 v t2: 25 65 t2 v t3:
What is close? Define semantic closeness. 0 35 s: 0 25 t1: 25 45 t2: • Minimal subsuming Cmin • No false positive: Miss no answer • Minimizing false negative: Fewest extra answers • Clear semantics: DB content independent • Modular translation: Reduce translation complexity ? 45 65 t3: Cmin 0 45 t1vt2: 0 65 t1vt2vt3:
What mechanism? Source Query Query Translation Target Query Search for closest Enumerate valid ? Source Query Target Query Cmin Attribute Match Predicate Mapping Query Rewriter
System architecture: Modular & lightweight • Modularized mechanism • Lightweight domain knowledge [ZhangHC- SIGMOD04] Form Extractor Form Extractor Source query Qs Target query form QI [RahmBernstein- VLDBJ01] [HeChang- SIGMOD03] Domain-specific Thesaurus Attribute Matcher: Syntax-based schema matching [WuYDM- SIGMOD04] ? Domain-specific type handlers Predicate Mapper: Type-based search-driven mapping Query Rewriter: Constraint-based query rewriting [Halevy-VLDBJ01] Target query Qt*
U Predicate Mapping Predicate Mapping The core challenge: Predicate mapping • Objective • Minimal subsuming • Tasks • Choose operator • Fill in values Input: output: Union of target predicate t*
Is source-specific translation applicable? price<$t if $t<25: [price:between:0,25] elseif $t<45: … … … 1 ……… 1 ………….. adult = $t passenger = $t … … …… 1 1 …….
Enable source-generic predicate mapping? What is the scope of translation? What is the mechanism of translation?
The right scope? Survey 150 sources for the Correspondence Matrix. • Correspondences occur within localities!
Source predicates Target template P Predicate Mapper Type Recognizer Domain Specific Handler Text Handler Numeric Handler Datetime Handler Target Predicate t* The right scope? Correspondence locality Type-based translation • Correspondences occur within localities • Translation by type-handler
The right mechanism: Is pairwise-rule based mechanism suitable? Rule: attr<$t if $t<25: [attr:between:0,25] elseif $t<45: … … … Template 1 n n+1 1 n new template n+1 • Adding one template needs to add 2n rules! • And need knowledge of the old templates.
More extendable mechanism? Search-driven. s Templates of same type t u evaluator evaluator Values of the type (virtual database) -infinite 0 1 +infinite Evaluate over “database” 0 35 s: 0 25 t1: 25 45 Search for closest t2: Evaluation results 25 45 t1v t2: … …
Greedy search to construct Cmin mapping • Find mapping iteratively • Each iteration, greedily choose the one covering maximal uncovered 0 35 s: 0 25 t1: 25 45 t2: 45 65 t3:
Experiments • Translating 120 queries in total • Between randomly paired sources from 8 domains • With domain thesaurus but no type handler • Accuracy as ratio of correct condition per query Extraction 40% 18% Matching 42% Mapping Basic: 3 domains New: 5 domains Average accuracy Error distribution
Conclusion • System: • Form assistant for querying Web databases • Problem • Dynamic query translation • Contributions: • Framework: Light-weight domain-based architecture • Techniques: Type-based search-driven pred. mapping
Thank You! For more information: http://metaquerier.cs.uiuc.edu kcchang@cs.uiuc.edu
Experiment: Accuracy distribution Accuracy distribution for New dataset Accuracy distribution for Basic dataset
Text handler: Search space • Conceptually, union of all target predicate • Practically, close-world assumption
Text handler: Closeness estimation • Ideally, logic reasoning • Practically, evaluation-by-materialization • Materialize query against a “complete” database