1 / 22

Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax

Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax. Zhen Zhang, Bin He and Kevin C. Chang. MetaQuerier Goals: Exploring and integrating the deep Web . FIND sources. QUERY sources. Integrator source selection schema integration query mediation. Explorer

minya
Download Presentation

Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax Zhen Zhang, Bin He and Kevin C. Chang

  2. MetaQuerier Goals: Exploring and integrating the deep Web FINDsources QUERY sources • Integrator • source selection • schema integration • query mediation • Explorer • source discovery • source modeling • source indexing Cars.com Amazon.com 411localte.com Apartments.com The Deep Web: Databases on the Web

  3. Problem: Source capability extraction– Or, query interface understanding. Book sources: Music sources

  4. Form understanding– What are the essential tasks? • Output all the conditions, for each: • Grouping elements (into query conditions) • Tagging elements with their “semantic roles” attribute operator value

  5. Demo summary: Query form Understanding: form structure Multiple interpretations

  6. Certainly not a trivial task -– Recall the “butterfly ballot” in U.S. Election 2000. Even just grouping can be hard!

  7. Baseline approach? The problem seems to be rather heuristic in nature… • There seem to be no clear criteria, but only fuzzy heuristics • Grouping is hard; it is often n-ary • Heuristic: Group two elements if they are “close” • But … • Tagging is hard; no semantic labeling in HTML forms • Heuristic: Tag the closest text as the “attribute” • But … • We need many such heuristics! • Goal : A principled mechanism to encode and use the various heuristics systematically?

  8. Our observation: concerted structures of QI • Condition pattern as building blocks • Convergence condition patterns

  9. Our insight: Cope with form complexity by their “composition patterns.” • “Lego”-like building blocks: • Pattern of elements composed into conditions • Pattern of conditions composed into a form • So, how to realize our divide-and-conquer idea? Any computation paradigm? Source Q-Form “Lego” Building Blocks ? Semantic Structure

  10. Query-form creation is guided by hidden syntax Hidden Syntax (Grammar) Composer Parser Our Hypothesis:Existence ofHidden-Syntax Presentation (Query Interface) Semantic Structure (Query Conditions) Attr: title Operator: title words,…. Value: string Parsing is thus a principled mechanism for the inverse

  11. This “language” paradigm enables principled solution to a seemingly heuristic problem Essential notions: Grammar and Parser— • Grammar: Pattern specification • Declarative • No need to hard-code heuristics • Collective • Capture both micro and macro patterns • Parser: Pattern recognition • Global • Coherently interpret an entire query form • Systematic • Systematically assembles the building blocks

  12. However, the hidden-syntax hypothesis itself entails challenges in its realization • Hidden syntax is only hypothetical • We must derive a grammar in its place • What should be captured in a “derived grammar”? • 2P-Grammar: Production + Preference • productions for patterns; preferences for their precedence • Derived grammar is secondary to any input • Inherently incomplete and ambiguous • What should be the machinery of a “soft parser”? • Best-effort Parser: • multiple, maximal-partial parse trees

  13. Tokenizer HTML Layout Engine Our Paradigm: Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Preferences Productions BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure

  14. Grammar: Layout based Traditional grammar (Sequential based 1-D) Our grammar (Layout based 2-D) Presentation 3 * 5 E :- E * E, or E :-sequential(E, *, E) TextCond :- [ left(TextAttr, TextVal) Ú above(TextAttr, TextVal) ] Ùabove(TextVal, TextOp) Grammar

  15. fix-point tokenization Form Form EnumSel EnumSel EnumRB EnumSel EnumSel EnumRB … iterative construction Parser: Logic programming style • Traditional parsing • Scan input sequentially • Our parsing • Nonlinear input • Arbitrary constraints Parse trees . . .

  16. That’s not all: complications of hypothetical syntax Hidden syntax is only hypothetical ! Grammar Ambiguous Incomplete Parser Multiple parse trees Partial parse trees

  17. Ambiguity TextCond: Below(Attr,Selection) • Grammar: • Preferences to capture the conventional precedence • eg. RButton ≥ TextCond • Parser: • Just-in-time pruning by preference • Multiple trees possible RButton: Left(radio,text))

  18. Incompleteness • Grammar • Cannot capture all patterns • Parser: • Cannot interpret entire query interfaces • Interpret as much as possible • Greedily choose the maximum parse trees • Reasoning: they look at big picture and consider more context

  19. Form Form EnumSel EnumSel EnumRB EnumSel EnumSel EnumRB Error Handling: “Best-effort” parser can output multiple and partial parse trees • Union all the conditions interpreted by all the parse trees. • Report both conflicts and missing errors Parsing Union

  20. Experiment: How a “global grammar” will do? Global grammar: • Derived from Basic; captures 21 patterns • 82 productions, 39 non-terminals, 16 terminals Datasets: • Basic: 3 domains (Airfare, Autos, Books); 150 sources • NewSource: same domains, 30 sources • NewDomain: 6 new domains (Music, …), 42 sources • Random: 30 sources (from invisible-web.net) • Correctness judgment: • Number of correctly identified (grouping and tagging) conditions

  21. Conclusion– Syntactic Parsing for Interface Understanding Query interface understanding by syntactic parsing with hidden grammars • Insight: Exploit how semantics connects to presentation, in a syntactic way • Future work: • Constructing grammar automatically • Developing more sophisticated preference framework • Extending the framework to other applications

  22. Thank you ! • For more information: • Online demo at MetaQuerier project Web site http://metaquerier.cs.uiuc.edu • Invite you to our MetaQuerier demo in the afternoon

More Related