960 likes | 1.22k Views
Concept Hierarchy Induction by Philipp Cimiano. Objective. Structure information into categories Provide a level of generalization to define relationships between data Application: Backbone of any ontology. Overview. Different approaches of acquiring conceptual hierarchies from text corpus.
E N D
Objective Structure information into categories Provide a level of generalization to define relationships between data Application: Backbone of any ontology
Overview • Different approaches of acquiring conceptual hierarchies from text corpus. • Various clustering techniques. • Evaluation • Related Work • Conclusion
Machine Readable Dictionaries Entries: ‘a tiger is a mammal’, or ‘mammals such as tigers, lions or elephants’. exploit the regularity of dictionary entries. the head of the first NP - hypernym.
Exception is-a (corolla, part)………..is a NOT VALID is-a (republican, member) ……….. is a NOT VALID is-a (corolla, flower)………..is a NOT VALID is-a (republican, political party)………..is a NOT VALID
Results using MRDs Dolan et al. - 87% of the hypernym relations extracted are correct Calzolari cites a precision of > 90% Alshawi - precision of 77%
Strengths And Weaknesses Correct, explicit knowledge Robust basis for ontology learning Weakness- domain independent
Lexico-Syntactic patterns Task: automatically learning hyponym relations from the corpora. 'Such injuries as bruises, wounds and broken bones' hyponym (bruise, injury) hyponym (wound, injury) hyponym (broken bone, injury)
Hearst patterns 'Such injuries as bruises, wounds and broken bones'
Requirements Occur frequently in many text genres. Accurately indicate the relation of interest. Be recognizable with little or no pre-encoded knowledge
Strengths And Weaknesses • Identified easily and are accurate Weakness: • patterns appear rarely • is-a relation do not appear in Hearst style pattern
Distribution Similarity 'you shall know a word by the company it keeps’ [Firth, 1957]. semantic similarity of words – similarity of the contexts.
Strengths And Weaknesses • reasonable concept hierarchy. Weakness: • Cluster tree lacks clear and formal interpretation • Does not provide any intentional description of concepts • Similarities may be accidental (sparse data)
Evaluation • Semantic cotopy (SC). • Taxonomy overlap (TO)
Strengths And Weaknesses • FCA generates formal concepts • Provides intentional description Weakness: • Size of the lattice can get exponential in the size • spurious clusters • Finding appropriate labels for the cluster
Problems with Unsupervised Approaches to Clustering • Data sparseness leads to spurious syntactic similarities • Produced clusters can’t be appropriately labeled
Guided Clustering • Hypernyms directly used to guide clustering • WordNet • Hearst • Agglomerative clustering
Similarity Computation Ten most similar terms of the tourism reference taxonomy
The Hypernym Oracle • Three sources • WordNet • Hearst patterns matched in a corpus • Hearst patterns matched in the World Wide Web • Record hypernyms and amount of evidence found in support of hypernyms.
WordNet • Collect hypernyms found in any dominating synset containing term, t • Include number of times the hypernym appears in a dominating synset
Hearst Patterns (Corpus) • Record number of isa-relations found between two terms
Hearst Patterns (WWW) • Download 100 Google abstracts for each concept and clue:
Evidence • Total Evidence for Hypernyms: • time: 4 • vacation: 2 • period: 2
Clustering Algorithm • Input a list of terms • Calculate the similarity between each pair of terms and sort from highest to lowest • For each potential pair to be clustered consult the oracle.
Consulting the Oracle case 1 • If term 1 is a hypernym of term 2 or vice-versa: • Create appropriate subconcept relationship.
Consulting the Oracle case 2 • Find the common hypernym for both terms with greatest evidence. • If one term has already been classified: t’ = h h is a hypernym of t’ t’ is a hypernym of h
Consulting the Oracle case 3 • Neither term has been classified: • Each term becomes a subconcept of the common hypernym.
Consulting the Oracle case 4 • The terms do not share a common hypernym: • Set aside the terms for further processing.
r-matches • For all unprocessed terms, check for r-matches (i.e. ‘credit card’ matches ‘international credit card’)
Further Processing • If either term in a pair is already classified as t’, the other term is classified under t’ as well. • Otherwise place both terms under the hypernym of either term with the most evidence. • Any unclassified terms are added under the root concept.