140 likes | 250 Views
myPortal: Robust Extraction and Aggregation of Web Content Marek Kowalkiewicz, Tomasz Kaczmarek, Witold Abramowicz. Tomasz Kaczmarek The Poznan University of Economics Poland. Background. Personalized access to information Dynamic content on web pages
E N D
myPortal: Robust Extraction and Aggregation of Web ContentMarek Kowalkiewicz, Tomasz Kaczmarek, Witold Abramowicz Tomasz Kaczmarek The Poznan University of Economics Poland
Background • Personalized access to information • Dynamic content on web pages • Various techniques for content extraction: • Based on unique ID • Using contextual information • Document tree analysis
myPortal vision • Ability to extract content blocks from HTML pages • Easy aggregation • Client side technology – no server side investments necessary • Stress on: • Robustness • Ease of use
My portal My portal My portal
Absolute XPath Relative XPath Extraction technique Extraction based on HTML DOM tree
Visual query specification Reference element Extracted content
Done: Extract content from any HTML page Record POST, GET parameters, cookies Access search results (via GET or POST) from search engines – subscription like service Work in progress: Deal with multi-stage login or query mechanisms – like obtaining bank account info Deal with information from multiple DOM tree branches in single query Functionality
Other (technical) problems • HTML code quality – HTML Tidy • WYSIWYG for aggregation • Robustness • Multiple occurrences of reference element • Document structure changes between reference and extracted elements • Deletion / change in the reference element
Research on robustness • Purpose: to check if relative XPath expressions are more robust than absolute XPath
Research method • Empirical tests on multiple portals • Manual query preparation for absolute and relative queries • Comparison of results in three categories: • Accurate extraction • Lack of result • Inaccurate extraction • Based on historical versions of portal sites obtained from Web Archive
Thank you! t.kaczmarek@kie.ae.poznan.pl