Automating the Extraction of Domain Specific Information from the Web

Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004 Research funded by NSF

Genealogical Information on the Web • Hundreds of thousands of sites • Some professional (Ancestry.com, Familysearch.org) • Mostly hobbyist (203,200 indexed by Cyndislist.com) • Search engines • “Walker genealogy” on Google: 199,000 results • 1 page/minute = 5 months to go through • Why not enlist the help of a computer?

Problems • No standard way of presenting data • Text formatted with HTML tags • Tables • Forms to access information • Sites have differing schemas

Proposed Solution • Based on Ontos and other work done by the BYU Data Extraction Group (DEG) • Able to extract from: • Single-Record or Multiple Record Documents • Tables • Forms • Scalable and robust to changes in pages • Easily adaptable to other domains

Text

Tables

Forms

URL List Single- or Multiple-Record Engine Document Retriever and Structure Recognizer User Query URL Selector Table Engine Data Constrainer Result Filter Result Presenter Form Engine Ontology System Overview

User Query • Generated from ontology • Generated once per application domain

User Query

URL Listand URL Selector • Contains Genealogy URLs • Search each URL—too much time • Select likely URLs • Distribute document processing using DOGMA

URL Listand Document Retriever

Document Structure Recognizer • Requests analysis from each Data Extraction Engine • Selects appropriate method

Data Extraction Engines • Text • Improved record-separation • Ability to handle single-record pages • Table • Forms

Data Constrainer • Selects attribute/value pairs • Fits data to ontology

Result Filter • Fits data to query • Returns to central Result Presenter

Result Presenter • Creates XML Schema from Ontology • Presents results to user

Result Presenter

Evaluation • Scalability • Query on large URL list • Experiment on number of PCs • Precision and recall • Recall difficult to determine • Query on small URL list • Adaptability • Car ontology • Small URL list

Conclusion • Integrates, builds on previous DEG work • Extracts from: • Single- or Multiple-Record Documents • Tables • Forms • Scalable • Only searches probable pages • Distributed with DOGMA • Robust to changes in pages • Ontology based—easily adapted to other domains

Automating the Extraction of Domain Specific Information from the Web

Automating the Extraction of Domain Specific Information from the Web

Presentation Transcript

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison

Information Extraction from Web Documents

Information Extraction from the World Wide Web

Open Information Extraction from the Web Oren Etzioni

Open Information Extraction from the Web

Towards Domain-Independent Information Extraction from Web Tables

Adapting Open Information Extraction to Domain-Specific Relations

Automating the Extraction of Genealogical Information from Historical Documents

Information Extraction from the World Wide Web

Information Extraction from the World Wide Web

Information Extraction on the Web

Information extraction from web pages using extraction ontologies

Information Extraction from Multimedia Content on the Social Web

Information Extraction from the World Wide Web

L-ISA Learning Domain Specific ISA relations from the WEB

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison

Information extraction from web pages using extraction ontologies