220 likes | 366 Views
Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax. Zhen Zhang, Bin He and Kevin C. Chang. MetaQuerier Goals: Exploring and integrating the deep Web . FIND sources. QUERY sources. Integrator source selection schema integration query mediation. Explorer
E N D
Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax Zhen Zhang, Bin He and Kevin C. Chang
MetaQuerier Goals: Exploring and integrating the deep Web FINDsources QUERY sources • Integrator • source selection • schema integration • query mediation • Explorer • source discovery • source modeling • source indexing Cars.com Amazon.com 411localte.com Apartments.com The Deep Web: Databases on the Web
Problem: Source capability extraction– Or, query interface understanding. Book sources: Music sources
Form understanding– What are the essential tasks? • Output all the conditions, for each: • Grouping elements (into query conditions) • Tagging elements with their “semantic roles” attribute operator value
Demo summary: Query form Understanding: form structure Multiple interpretations
Certainly not a trivial task -– Recall the “butterfly ballot” in U.S. Election 2000. Even just grouping can be hard!
Baseline approach? The problem seems to be rather heuristic in nature… • There seem to be no clear criteria, but only fuzzy heuristics • Grouping is hard; it is often n-ary • Heuristic: Group two elements if they are “close” • But … • Tagging is hard; no semantic labeling in HTML forms • Heuristic: Tag the closest text as the “attribute” • But … • We need many such heuristics! • Goal : A principled mechanism to encode and use the various heuristics systematically?
Our observation: concerted structures of QI • Condition pattern as building blocks • Convergence condition patterns
Our insight: Cope with form complexity by their “composition patterns.” • “Lego”-like building blocks: • Pattern of elements composed into conditions • Pattern of conditions composed into a form • So, how to realize our divide-and-conquer idea? Any computation paradigm? Source Q-Form “Lego” Building Blocks ? Semantic Structure
Query-form creation is guided by hidden syntax Hidden Syntax (Grammar) Composer Parser Our Hypothesis:Existence ofHidden-Syntax Presentation (Query Interface) Semantic Structure (Query Conditions) Attr: title Operator: title words,…. Value: string Parsing is thus a principled mechanism for the inverse
This “language” paradigm enables principled solution to a seemingly heuristic problem Essential notions: Grammar and Parser— • Grammar: Pattern specification • Declarative • No need to hard-code heuristics • Collective • Capture both micro and macro patterns • Parser: Pattern recognition • Global • Coherently interpret an entire query form • Systematic • Systematically assembles the building blocks
However, the hidden-syntax hypothesis itself entails challenges in its realization • Hidden syntax is only hypothetical • We must derive a grammar in its place • What should be captured in a “derived grammar”? • 2P-Grammar: Production + Preference • productions for patterns; preferences for their precedence • Derived grammar is secondary to any input • Inherently incomplete and ambiguous • What should be the machinery of a “soft parser”? • Best-effort Parser: • multiple, maximal-partial parse trees
Tokenizer HTML Layout Engine Our Paradigm: Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Preferences Productions BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure
Grammar: Layout based Traditional grammar (Sequential based 1-D) Our grammar (Layout based 2-D) Presentation 3 * 5 E :- E * E, or E :-sequential(E, *, E) TextCond :- [ left(TextAttr, TextVal) Ú above(TextAttr, TextVal) ] Ùabove(TextVal, TextOp) Grammar
fix-point tokenization Form Form EnumSel EnumSel EnumRB EnumSel EnumSel EnumRB … iterative construction Parser: Logic programming style • Traditional parsing • Scan input sequentially • Our parsing • Nonlinear input • Arbitrary constraints Parse trees . . .
That’s not all: complications of hypothetical syntax Hidden syntax is only hypothetical ! Grammar Ambiguous Incomplete Parser Multiple parse trees Partial parse trees
Ambiguity TextCond: Below(Attr,Selection) • Grammar: • Preferences to capture the conventional precedence • eg. RButton ≥ TextCond • Parser: • Just-in-time pruning by preference • Multiple trees possible RButton: Left(radio,text))
Incompleteness • Grammar • Cannot capture all patterns • Parser: • Cannot interpret entire query interfaces • Interpret as much as possible • Greedily choose the maximum parse trees • Reasoning: they look at big picture and consider more context
Form Form EnumSel EnumSel EnumRB EnumSel EnumSel EnumRB Error Handling: “Best-effort” parser can output multiple and partial parse trees • Union all the conditions interpreted by all the parse trees. • Report both conflicts and missing errors Parsing Union
Experiment: How a “global grammar” will do? Global grammar: • Derived from Basic; captures 21 patterns • 82 productions, 39 non-terminals, 16 terminals Datasets: • Basic: 3 domains (Airfare, Autos, Books); 150 sources • NewSource: same domains, 30 sources • NewDomain: 6 new domains (Music, …), 42 sources • Random: 30 sources (from invisible-web.net) • Correctness judgment: • Number of correctly identified (grouping and tagging) conditions
Conclusion– Syntactic Parsing for Interface Understanding Query interface understanding by syntactic parsing with hidden grammars • Insight: Exploit how semantics connects to presentation, in a syntactic way • Future work: • Constructing grammar automatically • Developing more sophisticated preference framework • Extending the framework to other applications
Thank you ! • For more information: • Online demo at MetaQuerier project Web site http://metaquerier.cs.uiuc.edu • Invite you to our MetaQuerier demo in the afternoon