1 / 32

Visual Web Information Extraction With Lixto

Visual Web Information Extraction With Lixto. Robert Baumgartner Sergio Flesca Georg Gottlob. Overview. Introduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work. HTML vs. XML.

asis
Download Presentation

Visual Web Information Extraction With Lixto

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob

  2. Overview • Introduction and Motivation • Wrapper Generation • Extraction Language/Mechanisms • Testing Lixto • Results • Strengths & Weakness • Current/Future Work

  3. HTML vs. XML • HTML & XML represent semi-structured data • HTML mainly presentation oriented • Web content typically formatted in HTML • HTML lacks data querying

  4. XML Advantages • XML structure/layout separation • XML provides suitable data representation • XML sets act as database • XML sets queried via, XML-GL, XML-QL, XQuery

  5. eBay Example • No data querying ability increases cost and time to retrieve information from web pages • Example: watch interesting eBay offers of notebooks • Criteria: • Auction contains the word “notebook” • Current value between GBP 1500 and 3000 • Received at least 3 bids

  6. eBay Problems • eBay does not support complex queries • Similar sites do not give restricted queries • Large number of results returned with no possibility to further restrict the results • Only one site can be queried at a time • Results from different queries cannot be compiled into a single structured file

  7. eBay Solution • Lixto introduces new ideas and programming language concepts for wrapper generation • Lixto translates HTML to XML • Resulting XML can then be queried and further processed • Wrappers applied automatically to extract information from changing web pages

  8. Lixto Advantages • Easy to learn • Full visual and interactive UI provided • No fine tuning required • No knowledge of internal language necessary • No knowledge of HTML necessary • Graphical region marking and selection • Works directly on browser-display pages, no additional view necessary

  9. Lixto Advantages • Extraction of target patterns based on: • Surrounding landmarks • Actual content • HTML attributes • Order of appearance • Semantic and syntactic concepts • Extraction from flat strings possible • Semi-automatic wrapper generation

  10. Advanced Lixto Features • Disjunctive pattern definitions • Crawling page links during extraction • Recursive wrapping • Extracted data can have disjoint structure from HTML source page • Internal data structure language Elog

  11. Implemented Lixto System

  12. Architecture and Implementation • Lixto created with Java using Swing, OroMather and JDOM • Lixto toolkit contains three modules: • Interactive Pattern Builder • Extractor • XML Generator

  13. Creating Wrappers • Lixto wrappers created interactively using patterns in a hierarchical order • Patterns names act as default XML elements <Item> <Price> • Sub patterns express 1:* relationships • Each pattern characterizes one kind of information • Each pattern is defined by one or more filters

  14. Filter Creation • User highlights desired target • Internally Elog rule created describing filter • Add restrictive conditions to filter • Goals added to Elog rule body • Filter conditions: • Before/after • Not before/not after • Internal • Range

  15. Pattern Creation Algorithm • Loading initial document creates a <document> pattern • User highlights instance of the pattern • Lixto displays all matched instances of the pattern

  16. Pattern Creation Algorithm • User can add filters to limit the matched targets • The set of filters is added to the <document> pattern • Test if <document> pattern extracts exactly the desired set of data • If yes, save the pattern, if no select new instance of the pattern

  17. Generation of a New Pattern

  18. The Lixto Browser

  19. Conditional Generation

  20. Visual Interface • Visual tree pattern construction • Regular expression string patterns • XML visualization tool • Concept generator • Regular expression / database driven • Creates “isCity”, “isDate” • Requires no regular expression knowledge

  21. Main Menu / Pattern Generation Menu

  22. Elog • Internal data storage language • Data-log like syntax and semantics • Invisible to the user • Specifically designed for hierarchical and modular data extraction • Flexible, intuitive, easily extensible • Patterns stored as narrowing (logical and) and broadening (logical or) steps • Elog rules are implementations of the visually defined filters

  23. Elog Extraction Program for eBay Example

  24. Document Model • Brackets specify character offsets • Nodes numbered in depth-first left-to-right fashion • HTML tags refer to element sets containing attribute names and values • <body> tag contains attributes • {(name,body), (bgcolor,FFFFFF),(elementtext,…)}

  25. HTML Example Page

  26. XML Translation

  27. Extraction Mechanisms • Tree extraction • Elements identified by tree path (*.table*.tr) • Attribute constraints reduce matched elements • Element path definition (epd): tree path + attribute constraints • String extraction • Strings stored in ‘context’ nodes • Regular expression matching

  28. HTML Tree Extraction

  29. Lixto Test Sites

  30. Results

  31. Strengths & Weakness • Intuitive UI (If it needs a manual it’s not a good program) • Highly customizable • Supports crawling across web sites • No tree output after crawling • Slow • Extracts only one target type at a time

  32. Current/Future Work • Extend tree structure to support crawling across multiple sites (crawling is currently supported) • Server based Lixto system • Automated heuristics • Support for multiple example targets at once • Embedding Lixto wrappers into information channel system

More Related