1 / 21

Automating the Extraction of Domain Specific Information from the Web

Automating the Extraction of Domain Specific Information from the Web. A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004. Research funded by NSF. Genealogical Information on the Web. Hundreds of thousands of sites

hanh
Download Presentation

Automating the Extraction of Domain Specific Information from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004 Research funded by NSF

  2. Genealogical Information on the Web • Hundreds of thousands of sites • Some professional (Ancestry.com, Familysearch.org) • Mostly hobbyist (203,200 indexed by Cyndislist.com) • Search engines • “Walker genealogy” on Google: 199,000 results • 1 page/minute = 5 months to go through • Why not enlist the help of a computer?

  3. Problems • No standard way of presenting data • Text formatted with HTML tags • Tables • Forms to access information • Sites have differing schemas

  4. Proposed Solution • Based on Ontos and other work done by the BYU Data Extraction Group (DEG) • Able to extract from: • Single-Record or Multiple Record Documents • Tables • Forms • Scalable and robust to changes in pages • Easily adaptable to other domains

  5. Text

  6. Tables

  7. Forms

  8. Forms

  9. URL List Single- or Multiple-Record Engine Document Retriever and Structure Recognizer User Query URL Selector Table Engine Data Constrainer Result Filter Result Presenter Form Engine Ontology System Overview

  10. User Query • Generated from ontology • Generated once per application domain

  11. User Query

  12. URL Listand URL Selector • Contains Genealogy URLs • Search each URL—too much time • Select likely URLs • Distribute document processing using DOGMA

  13. URL Listand Document Retriever

  14. Document Structure Recognizer • Requests analysis from each Data Extraction Engine • Selects appropriate method

  15. Data Extraction Engines • Text • Improved record-separation • Ability to handle single-record pages • Table • Forms

  16. Data Constrainer • Selects attribute/value pairs • Fits data to ontology

  17. Result Filter • Fits data to query • Returns to central Result Presenter

  18. Result Presenter • Creates XML Schema from Ontology • Presents results to user

  19. Result Presenter

  20. Evaluation • Scalability • Query on large URL list • Experiment on number of PCs • Precision and recall • Recall difficult to determine • Query on small URL list • Adaptability • Car ontology • Small URL list

  21. Conclusion • Integrates, builds on previous DEG work • Extracts from: • Single- or Multiple-Record Documents • Tables • Forms • Scalable • Only searches probable pages • Distributed with DOGMA • Robust to changes in pages • Ontology based—easily adapted to other domains

More Related