1 / 9

A Brief Survey of Web Data Extraction Tools (WDET)

A Brief Survey of Web Data Extraction Tools (WDET). Laender et al. Introduction. Web data is hard to query. A lot of unstructured data. Wrappers can help extract data. A wrapper maps a page to a repository . There are several ways to generate wrappers.

eldon
Download Presentation

A Brief Survey of Web Data Extraction Tools (WDET)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief Survey of Web Data Extraction Tools (WDET) Laender et al.

  2. Introduction • Web data is hard to query • A lot of unstructured data • Wrappers can help extract data • A wrapper maps a page to a repository • There are several ways to generate wrappers • This paper is a survey of different wrappers

  3. Taxonomy of WDET • Languages for Wrapper Development • HTML-aware Tools • NLP-based Tools • Wrapper Induction Tools • Modeling based Tools • Ontology based Tools

  4. Overview of WDET • Languages for Wrapper Development procedural programming languages(Minerva, TSIMMIS) • HTML-aware Tools W4F, XWRAP, RoadRunner • NLP-based Tools Uses free text form (RAPIER, SRV, WHISK)

  5. Taxonomy of WDET • Wrapper Induction Tools Generates wrappers from input(WIEN,SoftMealy,STALKER) • Modeling based Tools Based on hierarchies of objects(NoDoSE, DEByE) • Ontology based Tools Uses Conceptual Models or Ontologies (BYU tool)

  6. Qualitative Analysis • Degree of Automation • Support for Complex Objects • Page Contents: Semistructured data or text • Ease of Use • XML Output • Support for Non-HTML Sources • Resilience and Adaptiveness

  7. Conclusions

  8. Conclusions

  9. Questions

More Related