260 likes | 429 Views
LREC - 2010. Semi-Automatic Domain Ontology Creation from Text Resources. Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis. Jaguar Overview.
E N D
LREC - 2010 Semi-Automatic Domain Ontology Creation from Text Resources Authors Mithun Balakrishna, Dan Moldovan, Marta Tatu, Marian Olteanu Presented by Chris Irwin Davis
Jaguar Overview • Jaguar: Builds Ontologies and Knowledge-Bases from the concepts and relationships between those concepts found in text. • Constituents of a knowledge base • Concepts/Vocabulary (“weapon”, “WMD”, “launcher”) • Relations (“anthrax” ISA “biological weapon”, “anthrax” CAU “death”) • 26 different semantic relation types extracted • Organization of Relations • Hierarchical • Contextual LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Types of Knowledge • Universal (or ontological) • Represented in Hierarchies • Simple binary relations between concepts • “Chemical weapons such as nerve gas, …” • Contextual • Represented in individual (semantic) contexts • Groups of relations centered on a common concept • “The forces launched a full-scale attack on Monday” chemical weapon nerve gas launch AGT forces full-scale attack THM TMP monday LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
KB Constituents Knowledge Base Ontology anthrax Concept Set assassinate Contextual Knowledge C1 biological weapon C7 AGT rebel C5 C3 political leader C21 C6 THM C4 C2 R1 C22 TMP may 21 R2 C23 Hierarchy R3 C24 pw C4 C14 isa C33 isa isa R4 C36 C3 C16 pw R5 C37 isa C11 pw C13 LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Jaguar Overview Documents SeedsOntology(structured knowledge) Ontology + pointers to text Knowledge Base (ontology + contextual knowledge + pointers to text) Jaguar • Functionality • Produce ontologies • Link concepts & relations to text • Visualize ontology • Edit ontology • Enhance an existing ontology • Merge two ontologies into a consistent ontology • Ontological search of documents (search documents using ontology) LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Knowledge Bases • Ontology/KB creation overview • Knowledge Extraction from Text • Pattern recognition; Semantic Parsing • Knowledge Representation and Storage • Contextual vs. Universal • XML; Relational Database • Knowledge Base Maintenance • Conflict Resolution; Ontology Merging • User Interaction; Ontology Modification LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Text Processing Classification Hierarchy Creation Knowledge Base Maintenance Jaguar – Process & Modules Documents Seeds (keywords-list or Ontology) PreProcessor: Text-Extraction from HTML. MS Word & PDF Docs Input: Documents, Seeds • Extract “concepts” of interest • Extract binary relations (universal) • Use Semantic Parser to obtain contextual knowledge Output: Concepts, Contexts, Binary Relations “The rebels had access to chemical weapons, such asnerve gasand other poisonous gases.” Chopshop: Tokenization Jaguar Post: Part-of-speech Tagging Rose: Named Entity Recognition Relu: Syntactic Parsing Talbot: Word Sense Disambiguation ConceptTagger: Concept/Temporal Tagging Polaris: Semantic Parsing Text Processing Knowledge Base (ontology + contextual knowledge + pointers to text) Ontology + pointers to text LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology Creation • Polaris: Extract semantic relations in text • Pattern matching and machine learning • Syntactic parse tree broken down into a number of syntactic patterns • Syntactic patterns include verbs and their arguments, complex nominals, adjective phrases, adjective clauses, and others. • There are six primary pattern types discovered within noun phrases: • N-N and Adj-N (which comprise compound nominals) • ’s and of (Genitive patterns) • Adjective Phrases • Adjective Clauses • first five further subdivided into nominalized and non-nominalized (giving a total of 11 patterns discovered within compound nominals) • There are also five verb argument level patterns being discovered: • NP verb • verb NP • verb PP • verb ADVP • verb S Jaguar Text Processing Classification Hierarchy Creation Knowledge Base Maintenance LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology Creation Input: Concepts, Binary Relations • Classify each concept against every other using defined procedures, obtaining set of ISA relations • Add all ISA and other binary relations to the hierarchy using conflict resolution Output: Hierarchy of relations “Scud missile” ISA“missile” “Squadron” PW“Platoon” “weapons inspection team” ISA“inspection team” Jaguar/KAT Classification/Hierarchy Creation Text Processing Classification Hierarchy Creation Knowledge Base Maintenance LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology Creation • Classification Procedures: • Procedure 1: Classify a concept of the form [word, head] with respect to concept [head] • Procedure 2: Classify a concept [word1, head1] with respect to another concept [word2, head2] • Procedure 3: To classify a concept [word1, word2, head] • Procedure 4: Classify a concept [word1, head] with respect to a concept hierarchy under [head] Jaguar/KAT Text Processing Classification Hierarchy Creation Knowledge Base Maintenance LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology Creation • Knowledge Base Merging • Visualization • Knowledge Base Editing • User Interaction • Modifications Knowledge Base Maintenance Jaguar/KAT Text Processing Classification Hierarchy Creation Knowledge Base Maintenance LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology/KB Creation - Example LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology/KB Creation - Example LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Conflict Resolution Algorithm • Approach Used: Prevention • Start from an empty hierarchy and an input relation set • Add a relation from the input set to the hierarchy, if: • It does not form a cycle • It is not redundant (does not duplicate a path) • After the addition of any relation, algorithms (jump link removal) are run to ensure that all jump links are removed LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Knowledge Base Merging • Current Approach • Label the bigger ontology L1, and the other L2 • Merge concepts (from those in L2 into those of L1) • Copy all contexts (from L2 to L1) • Add all relations (from the hierarchy of L2 to the hierarchy of L1) using the conflict resolution algorithm • Additionally, classify all concepts in L1’s hierarchy against concepts in L2’s hierarchy (form relation set R) • Add relations from R into L1’s hierarchy (conflict resolution) LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Merging Hierarchies L2 L1 financial market work_place industry money_market exchange market capital market money_market stock_market stock_exchange LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Merging Hierarchies Simulating Classification “capital market” ISA “market” “financial market” ISA “market” “money_market” ISA “financial market” “stock_market” ISA “capital market” “capital market” ISA “financial market” “stock_market” SYN “stock_exchange” L1 industry work_place market exchange financial market money_market capital market stock_market stock_market, stock_exchange LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Semantic Relation Evaluation • Training corpus: • noun phrase patterns: Wall Street Journal (TreeBank 2), L.A. Times (TREC 9), and XWN 2.0 • verb argument patterns: FrameNet • Three evaluation corpora to benchmark the Polaris semantic relations: • TreeBank: we manually annotated 500 random sentences from the Penn Treebank 3 corpus with 5879 semantic relations. • GlassBox Human: 51 random sentences from the NIMD corpus was manually POS-tagged, syntactically parsed and semantically annotated with 706 semantic relations. • GlassBox Machine: the same 51 sentences used in GlassBox Human evaluation corpus was POS-tagged, syntactically parsed by our NLP tools and then manually annotated with 741 semantic relations. LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Semantic Relation Evaluation • For Treebank evaluation corpus: • Polaris discovered 5245 relations • 2212 exact matches to the human annotations • 630 partial matches • partial matches mean that while the relation type was correct and the argument bracketing at least overlapped, there were some extra or missing tokens in the generated arguments • partial matches are scored using precision, recall, and f-measure on the overlapping tokens • For the GlassBox Human evaluation corpus: • Polaris discovered 449 relations • 311 were perfect matches to the human annotations • 56 were partial matches • For the GlassBox Machine evaluation corpus: • Polaris discovered 464 relations • 249 were perfect matches to the human annotations • 71 were partial matches LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Semantic Relation Evaluation LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology Library Creation • We use Jaguar to create an ontology library for the 33 topics defined in NIPF and 10 topics from the financial domain • NIPF is the Director of National Intelligence’s (DNI’s) guidance to the Intelligence Community on the national intelligence priorities approved by the President of the United States of America • For each topic, we collected 500 documents from the web and manually verified their relevance to the corresponding topic. • For each topic, Jaguar is provided with an initial seed set containing on average 47 concepts of interest LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology/KB Evaluation • We evaluated the quality of 8 Jaguar ontologies by comparing them against manual gold annotations • Our evaluations are focused on the • Lexical Level • Vocabulary, or Data Layer Level • Other Semantic Relations Level • Viewing an ontology as a set of semantic relations between two concepts, the human annotators: • Labeled an entry correct if the concepts and the semantic relation are correctly detected by the system else marked the entry as Incorrect • Labeled a correct entry as irrelevant if any of the concepts or the semantic relation are irrelevant to the domain • From the sentences added new entries if the concepts and the semantic relation were omitted by Jaguar LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
NIPF Ontology/KB Evaluation - Metrics Nj(.) gives the counts from Jaguar’s output Ng(.) correspond to counts in the user annotations LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology/KB Evaluation - Results LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Domain Ontology/KB Evaluation - Results LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources
Conclusions • We presented a generalized and improved procedure to automatically extract deep semantic information from text resources • A methodology to rapidly create semantically-rich domain ontologies while keeping the manual intervention to a minimum • We defined evaluation metrics to assess the quality of the ontologies and presented evaluation results for a subset of the intelligence and financial ontology libraries, semi-automatically created using freely-available textual resources from the Web • The results show that a decent amount of knowledge can be accurately extracted while keeping the manual intervention in the process to a minimum. LREC 2010 - Semi-Automatic Domain Ontology Creation from Text Resources