280 likes | 414 Views
A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System. Alan Wessman Brigham Young University MS Thesis Defense. Based in part on research funded by the National Science Foundation. Presentation Overview. Background of legacy Ontos
E N D
A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Alan Wessman Brigham Young University MS Thesis Defense Based in part on research funded by the National Science Foundation.
Presentation Overview • Background of legacy Ontos • Assumptions, challenges, concerns • Framework as solution • Explain framework • Explain reference implementation • Evaluation of system • Future work and conclusion
Data Extraction • Goals of data extraction • Find relevant data in unstructured or semi-structured documents • Map extracted data to a formal structure • Approaches • Wrappers (ROADRUNNER, TSIMMIS) • NLP and machine learning (RAPIER, WHISK) • Ontologies (Ontos)
Ontos • Developed by Data Extraction Group (DEG) at BYU • Based on OSM ontologies and data frames • Focuses on multiple-record extraction • Good precision/recall • Resilient to document changes
Ontos Assumptions • OSML ontologies • Single- or multiple-record text documents • Each document/record relevant to domain • Heuristics produce accurate mappings • Output to relational database
Architectural Concerns • Variety of technologies • Different OSM representations • Highly coupled code • Difficult to install elsewhere • Difficult to upgrade or extend
Thesis Statement A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research. We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.
Frameworks • Abstract architecture • Decouple independent functions • Define interfaces • Use abstract classes, interfaces, declarative configuration files • Allow quick adjustment of system settings without re-coding • Make a system customizable Image from http://www.mcoe.org
Creating an Extraction Framework • Analyze systems • Generalize functionality • Define interfaces • Create supporting code • Document framework
Managing the Process • DataExtractionEngine • Main class • Initialize, perform extraction, finalize • ExtractionPlan • Defines order of steps in the extraction process • Can be imperative, declarative, or dynamic (like SQL execution plan)
Handling Documents • DocumentRetriever • Responsible for locating relevant documents • Search engine, local filesystem, CMS • DocumentStructureRecognizer • Decides which DocumentStructureParser to use • DocumentStructureParser • Breaks document into individual records or sub-documents • Record separator, table analyzer • ContentFilter • Normalizes document text • Strips out unwanted markup, stopwords, etc.
Extracting Values • ValueRecognizer • Uses matching rules defined in ontology • Produces set of candidate matches (like data record table) • ValueMapper • Accepts or rejects candidate matches • Assigns accepted matches to elements of the ontology (e.g., object sets) • OntologyWriter • Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)
OSMX • Legacy Ontos: OSML • OntologyEditor: OSM.dtd • New standard is OSMX • XML Schema (better constraints; validation) • JAXB generates corresponding Java classes • Common language for DEG tools • Allows data to be stored inline with model
Managing the Process • OntosEngine • Main class for Ontos system • Takes parameters from command line or configuration file • OntosExtractionPlan • Sequentially retrieves, parses, filters, and extracts from individual documents • Imperative (hard-coded) algorithm
Handling Documents • LocalDocumentRetriever • Retrieves documents from local filesystem • Filename filter excludes irrelevant files • FanoutRecordSeparator • Implements DocumentStructureParser • Locates record boundaries and creates sub-documents • HTMLFilter • Removes all HTML markup from documents
Recognizing Values: DataFrameMatcher • Uses data frame enhancements: • Keyword affinity (left and right) • Require context for left, right, or both • Value phrase-specific keywords • Link matches back to specific patterns • Other improvements: • Consistent regular expression handling • Unlimited recursive macro definition
Mapping Values: HeuristicBasedMapper • New algorithm • Fully recursive wrt ontology structure • ContextualHeuristic generates objects • Connection-based heuristics (singleton, nested-group, etc.) generate relationships • See paper for additional details
Output • Human-readable HTML format • Easier to count correct, partial, incorrect mappings
Using the Framework and Reference Implementation • Adding new features • Create new implementation classes • Extend (subclass) existing implementations • Switching feature set • Change class name in config file • Override class on command line
Evaluating the Framework • Input: • Obituaries ontology • 25 obituaries from two newspapers Four of eighteen object sets shown above. Data from Salt Lake Tribune and Arizona Daily Star
Statistics about the System * Includes comments and whitespace. ** JAXB-generated classes add 197 files and 62,888 lines of code.
Future Work • Algorithm improvements • On-the-fly lexicons • Machine learning techniques • Confidence values • Canonicalization • Expected participation cardinality • Negative-indicator keywords • Integration • Online search engines • Semantic Web annotator and query engine • Web interface to extraction engine
Contributions • Design and construction of a data-extraction framework • Reference implementation • Ontos upgrade • Pattern for future use of framework • OSMX • Standardized storage format • http://www.deg.byu.edu/xml/osmx.xsd
Contributions • Uniform codebase and language • OntologyEditor migration • New graphics classes • Extended data frame support • Modular heuristic-based mapper • Concept of extraction plans • Flexible research platform
Conclusion • Framework gives us the flexibility we need for further data-extraction research • Framework is capable of supporting Ontos functionality • OSMX and reference implementation provide solid base for future research applications