130 likes | 147 Views
This research project, supported by NSF, aims to extract information from unstructured documents using an abstract framework. Tasks include mapping data to an ontology, applying heuristics, and satisfying ontology constraints. The Ontos algorithm integrates domain knowledge and recognizes data frames to infer nonlexical objects. The current heuristics prioritize object sets based on appearance and allow early decisions to prevent later issues.
E N D
An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported by NSF
Data Extraction • Goal: Find useful information in documents without known formal structure • Primary tasks: • Locate data of interest to application • Map identified data to an ontology
Ontos • BYU approach to data extraction • Domain knowledge encoded as ontology • Defines target data structure • Contains data recognition rules (“data frames”) • Heuristics map extracted values to ontology • Populate sets of objects and relationships • Infer nonlexical objects • Satisfy ontology constraints • Ontos algorithm puts it all together
Current Heuristics • Object sets processed in order of appearance • Accept-or-reject: Early bad choice prevents later better choices --- OBITUARIES ONTOLOGY --- Marriage Date matches [20] keyword "\bmarried\b"; end; Funeral Date matches [20] keyword "\bfuneral\b"; end; -- Deceased Person Deceased Person [-> object]; Deceased Person [0:1] has Marriage Date [1:*]; Deceased Person [0:1] has Funeral [1]; ... -- Funeral Funeral [0:1] is on Funeral Date [1:*]; ... -- Generalization/Specializations Marriage Date, Funeral Date : Date; Lemar K. Adamsonage 84, of Tucson, died September 30, 1998. He was born June 12, 1914 in Salt Lake City, Utah. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by BRING'S MEMORIAL CHAPEL, 236 S. Scott
Additional Problems • Generalization/specialization • Previously extracted data • Complex document structure • Overlapping value domains • Tunable parameters and extraction algorithm
Previously Extracted Data 235. Foundations of Computer Science 1. (4:4:1) F, W, Sp, Su Prerequisite: CS 142. Iteration, induction, recursion, lists, trees, sets, relations, functions; mathematical analysis of algorithms and data models; object-oriented implementation of abstract data types. 236. Foundations of Computer Science 2. (4:4:1) F, W, Sp, Su Prerequisite: CS 235. Continuation of CS 235; relations, graphs, automata, grammars, propositional and predicate logic. Implementation of object-oriented algorithms.
Complex Document Structure • Major sections with varying internal structures • Nested lists with unstructured text • Headings interspersed among records • Icons, hyperlinks, etc.
Overlapping Value Domains student at Lincoln High School, won the state thought Lincoln himself was probably rolling over in his grave at the idea drove all the way to Lincoln, where we ate at When his history lesson about Abraham Lincoln finally ended, Steve left Lincoln High and drove his Lincoln Continental down to Lincoln, Nebraska.
Tunable Parameters & Algorithm • Confidence values • Names: William = 0.9; Rose = 0.6; Spatula = 0.03 • Weighted heuristics • Empirically, heuristic A is 2.3 times better than heuristic B • Acceptance thresholds • “If ConfidenceValue(Name) > 0.5, accept” • Candidate ranking • Heuristics vote; combine results; order candidate values and accept top n • Algorithm • When to retrieve, parse, extract, or populate target
Our Approach We can remedy deficiencies in the Ontos heuristics by defining an abstract framework that allows the ontology designer to: • Implement more accurate and powerful heuristics (specific to the ontology’s needs), and • Control elements of the extraction plan (order in which documents are retrieved and parsed, heuristics are applied, etc.)
Progress • Researched HMM-based heuristics • Constructed XML Schema for ontologies • Solidified specialization semantics • Provided for directly populating ontology with extracted values • Implementation is proceeding…