270 likes | 514 Views
Unsupervised Ontology Acquisition from plain texts : The OntoGain method. Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems Laboratory http://www.intelligence.tuc.gr Technical University of Crete (TUC), Chania , Greece. OntoGain.
E N D
Unsupervised Ontology Acquisition from plain texts: The OntoGain method EfthymiosDrymonas KalliopiZervanou Euripides G.M. Petrakis Intelligent Systems Laboratory http://www.intelligence.tuc.gr Technical University of Crete (TUC), Chania, Greece
OntoGain • A platform for unsupervised ontology acquisition from text • Application independent • Ontology of multi-word term concepts • Adjusts existing methods for taxonomy & relation acquisition to handle multi-word concepts • Outputs ontology in OWL • Good results on Medical, Computer science corpora 2
Why multi-word term concepts? • Majority of terminological expressions • Convey classificatory information, expressed as modifiers • e.g. “carotid artery disease” denotes a type of “artery disease” which is a type of “disease” • Leads to more expressive and compact ontology lexicon 3
Ontology Learning Steps • Concept Extraction • C/NC-value • Taxonomy Induction • Clustering, Formal Concept Analysis • Non-taxonomic Relations • Association Rules, Probabilistic algorithm 4
The C/NC-Value method[Frantziet.al. , 2000] • Identifies multi-word term phrases denoting domain concepts • Noun phrases are extracted first • ((adj | noun)+ | ((adj | noun) *(noun prep)?) (adj | noun) *) noun • C-Value: Term validity criterion, relying on the hypothesis that multi-word terms tend to consist of other terms • NC-Value: Uses context information (valid terms tend to appear in specific context and co-occur with other terms) 5
C-Value: Statistical Part • For candidate term a • f(a):Total frequency of occurrence • f(b): Frequency of a as part of longer terms • P(Ta): number of these longer terms • |a|: The length of the candidate string Concept Extraction
Ontology Learning Steps • Preprocessing • Concept Extraction • Taxonomy Induction • Non-taxonomic Relations 8
Taxonomy Induction • Aims at organizing concepts into a hierarchical structure where each concept is related to its respective broader and narrower terms • Two methods in OntoGain • Agglomerative clustering • Formal Concept Analysis (FCA)
Agglomerative Clustering • Proceeds bottom-up: at each step, the most similar clusters are merged • Initially each term is considered a cluster • Similarity between all pairs of clusters is computed • The most similar clusters are merged as long as they share terms with common heads • Group average for clusters, Dice like formula for terms 10
Formal Concept Analysis (FCA) [Ganter et al., 1999] • FCA relies on the idea that the objects (terms) are associated with their attributes (verbs) • Finds common attributes (verbs) between objects and forms object clusters that share common attributes • Formal concepts are connected with the sub-concept relationship
FCA Example • Takes as input a matrix showing associations between terms (concepts) and attributes (verbs)
FCA Taxonomy • Formal concepts • ({hierarchical clustering, root node, single cluster}, {compute, search}) • ({html form, web page}, {print, search}) • Not all dependencies c,v are interesting 13
Non-Taxonomic Relations extraction phase • Concept Extraction • Taxonomy Induction • Non-Taxonomic Relations 14
Non-Taxonomic Relations • Concepts are also characterized by attributes and relations to other concepts in the hierarchy • Typically expressed by a verb relating pair of concepts • Two approaches • Associations rules • Probabilistic
Association Rules [Aggrawal et.al., 1993] • Introduced to predict the purchase behavior of customers • Extract terms connected with some relation subject-verb-object • Enhance with general terms from the taxonomy • Eliminate redundant relations: • predictive accuracy < t
Probabilistic approach [Cimiano et.al. 2006] • Collect verbal relations from the corpus • Find the most general relation wrt verb using frequency of occurrence • Suffer_from(man, head_ache) • Suffer_from(woman, stomach_ache) • Suffer_from(patient,ache) • Select relationships satisfying a conditional probability measure • Associations > t become accepted 18
Evaluation • Relevance judgments are provided by humans • Precision - Recall • We examined the 200 top-ranked concepts and their respective relations in 500 lines • Results from OhsuMed & Computer Science corpus 19
Results 20
Comparison with Text2Onto [Cimiano & Volker, 2005] • Huge lists of plain single word terms, and relations lacking of semantic meaning • Text2Onto cannot work with big texts • Cannot export results in OWL 21
Conclusions • OntoGain • Multi-word term concepts • Exports ontology in OWL • Domain independent • Results • C/NC-Value yields good results • Clustering outperforms FCA • Association Rules perform better than Verbal Expressions 22
Future Work • Explore more methods / combinations • e.g., clustering, FCA • Hearst patterns for discovering additional relation types (Part-of) • Discover attributes and cardinality constraints • Incorporate term similarity information from WordNet, MeSH • Resolve term ambiguities 23
Thank you! Questions ? 24
Preprocessing • Tokenization, POS tagging, Shallow parsing (OpenNLP suite) • Lemmatization (WordNet Java Library • Apply to all steps of OntoGain • Shallow parsing is used in relations acquisition for the detection of verbal dependencies
Terms sharing a head tend to be similar • e.g. hierarchical methodand agglomerative methodare both methods • Nested terms are related to each other • e.g. agglomerative clustering methodand clustering method should be associated) 26