320 likes | 550 Views
Visual Web Information Extraction With Lixto. Robert Baumgartner Sergio Flesca Georg Gottlob. Overview. Introduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work. HTML vs. XML.
E N D
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob
Overview • Introduction and Motivation • Wrapper Generation • Extraction Language/Mechanisms • Testing Lixto • Results • Strengths & Weakness • Current/Future Work
HTML vs. XML • HTML & XML represent semi-structured data • HTML mainly presentation oriented • Web content typically formatted in HTML • HTML lacks data querying
XML Advantages • XML structure/layout separation • XML provides suitable data representation • XML sets act as database • XML sets queried via, XML-GL, XML-QL, XQuery
eBay Example • No data querying ability increases cost and time to retrieve information from web pages • Example: watch interesting eBay offers of notebooks • Criteria: • Auction contains the word “notebook” • Current value between GBP 1500 and 3000 • Received at least 3 bids
eBay Problems • eBay does not support complex queries • Similar sites do not give restricted queries • Large number of results returned with no possibility to further restrict the results • Only one site can be queried at a time • Results from different queries cannot be compiled into a single structured file
eBay Solution • Lixto introduces new ideas and programming language concepts for wrapper generation • Lixto translates HTML to XML • Resulting XML can then be queried and further processed • Wrappers applied automatically to extract information from changing web pages
Lixto Advantages • Easy to learn • Full visual and interactive UI provided • No fine tuning required • No knowledge of internal language necessary • No knowledge of HTML necessary • Graphical region marking and selection • Works directly on browser-display pages, no additional view necessary
Lixto Advantages • Extraction of target patterns based on: • Surrounding landmarks • Actual content • HTML attributes • Order of appearance • Semantic and syntactic concepts • Extraction from flat strings possible • Semi-automatic wrapper generation
Advanced Lixto Features • Disjunctive pattern definitions • Crawling page links during extraction • Recursive wrapping • Extracted data can have disjoint structure from HTML source page • Internal data structure language Elog
Architecture and Implementation • Lixto created with Java using Swing, OroMather and JDOM • Lixto toolkit contains three modules: • Interactive Pattern Builder • Extractor • XML Generator
Creating Wrappers • Lixto wrappers created interactively using patterns in a hierarchical order • Patterns names act as default XML elements <Item> <Price> • Sub patterns express 1:* relationships • Each pattern characterizes one kind of information • Each pattern is defined by one or more filters
Filter Creation • User highlights desired target • Internally Elog rule created describing filter • Add restrictive conditions to filter • Goals added to Elog rule body • Filter conditions: • Before/after • Not before/not after • Internal • Range
Pattern Creation Algorithm • Loading initial document creates a <document> pattern • User highlights instance of the pattern • Lixto displays all matched instances of the pattern
Pattern Creation Algorithm • User can add filters to limit the matched targets • The set of filters is added to the <document> pattern • Test if <document> pattern extracts exactly the desired set of data • If yes, save the pattern, if no select new instance of the pattern
Visual Interface • Visual tree pattern construction • Regular expression string patterns • XML visualization tool • Concept generator • Regular expression / database driven • Creates “isCity”, “isDate” • Requires no regular expression knowledge
Elog • Internal data storage language • Data-log like syntax and semantics • Invisible to the user • Specifically designed for hierarchical and modular data extraction • Flexible, intuitive, easily extensible • Patterns stored as narrowing (logical and) and broadening (logical or) steps • Elog rules are implementations of the visually defined filters
Document Model • Brackets specify character offsets • Nodes numbered in depth-first left-to-right fashion • HTML tags refer to element sets containing attribute names and values • <body> tag contains attributes • {(name,body), (bgcolor,FFFFFF),(elementtext,…)}
Extraction Mechanisms • Tree extraction • Elements identified by tree path (*.table*.tr) • Attribute constraints reduce matched elements • Element path definition (epd): tree path + attribute constraints • String extraction • Strings stored in ‘context’ nodes • Regular expression matching
Strengths & Weakness • Intuitive UI (If it needs a manual it’s not a good program) • Highly customizable • Supports crawling across web sites • No tree output after crawling • Slow • Extracts only one target type at a time
Current/Future Work • Extend tree structure to support crawling across multiple sites (crawling is currently supported) • Server based Lixto system • Automated heuristics • Support for multiple example targets at once • Embedding Lixto wrappers into information channel system