210 likes | 321 Views
Automating the Extraction of Domain Specific Information from the Web. A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004. Research funded by NSF. Genealogical Information on the Web. Hundreds of thousands of sites
E N D
Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004 Research funded by NSF
Genealogical Information on the Web • Hundreds of thousands of sites • Some professional (Ancestry.com, Familysearch.org) • Mostly hobbyist (203,200 indexed by Cyndislist.com) • Search engines • “Walker genealogy” on Google: 199,000 results • 1 page/minute = 5 months to go through • Why not enlist the help of a computer?
Problems • No standard way of presenting data • Text formatted with HTML tags • Tables • Forms to access information • Sites have differing schemas
Proposed Solution • Based on Ontos and other work done by the BYU Data Extraction Group (DEG) • Able to extract from: • Single-Record or Multiple Record Documents • Tables • Forms • Scalable and robust to changes in pages • Easily adaptable to other domains
URL List Single- or Multiple-Record Engine Document Retriever and Structure Recognizer User Query URL Selector Table Engine Data Constrainer Result Filter Result Presenter Form Engine Ontology System Overview
User Query • Generated from ontology • Generated once per application domain
URL Listand URL Selector • Contains Genealogy URLs • Search each URL—too much time • Select likely URLs • Distribute document processing using DOGMA
Document Structure Recognizer • Requests analysis from each Data Extraction Engine • Selects appropriate method
Data Extraction Engines • Text • Improved record-separation • Ability to handle single-record pages • Table • Forms
Data Constrainer • Selects attribute/value pairs • Fits data to ontology
Result Filter • Fits data to query • Returns to central Result Presenter
Result Presenter • Creates XML Schema from Ontology • Presents results to user
Evaluation • Scalability • Query on large URL list • Experiment on number of PCs • Precision and recall • Recall difficult to determine • Query on small URL list • Adaptability • Car ontology • Small URL list
Conclusion • Integrates, builds on previous DEG work • Extracts from: • Single- or Multiple-Record Documents • Tables • Forms • Scalable • Only searches probable pages • Distributed with DOGMA • Robust to changes in pages • Ontology based—easily adapted to other domains