Knowledge Discovery in Ontology Learning

Knowledge Discovery in Ontology Learning A survey

Outline • Introduction • OL Data Input • OL Application Fields • OL Methods • OL Tools (practical session)

Introduction • Ontology Engineering is a time-consuming task • Ontology Learning (OL) is the semi-automatic process supporting ontology engineering • OL it is a bottom-up and data-driven process • OL is an interdisciplinary field

OL Data Input • Pure NL text • Ontologies • KB (DB) instances • Schemata • DB schemata • Web schemata • Log files

OL Application Fields • OL can support Ontology Engineering (and management) in different phases. • Ontology extraction: based on some input the ontology engineer gets ontology proposal. • Ontology reuse: pruning existing domain ontologies for a specific application. • Ontology interoperability (multiple ontology management): mapping discovery.

OL Methods (outline) • Ontology Extraction (from text) • Weak ontology notion • Document Ontology extraction • Strong ontology notion • Association rules • Conceptual clustering • Ontology Reuse • Ontology Pruning • Ontology Learning for interoperability

Document Ontology extraction (1) • Extraction of concepts from a set of documents and identification of relationships between these concepts with different individual terms [3] • No semantic relations extraction • Only concepts extraction (aggregation of terms identified with the same concept) • Use of statistical analisys above a set of documents • Good for domain specific applications

Document Ontology extraction (2) • Input (text documents) • Pre-processing • Normalization • LSI (using SVD) • Document Ontology Construction

Document Ontology extraction (3) Singular Value Decomposition A U S VT = X X Terms Terms r x r r x n Documents Concepts m x n m x r

Association Rules (1) • Make use of shallow text processing techniques [6] • No taxonomic relation • Assumption: syntactic relations  semantic relations

Association Rules (2) • Preprocess the text documents • Morphological analysis • Recognition of name entities • Retrieval of domain specific concepts (if available) • Disambiguation using context information • Determine Concept Pairs set (CP) using several heuristic (either general or domain dependant) • NP-PP heuristic • Sentence heuristic • Title heuristic

|{ti|Xk Yk ti}| n |{ti|Xk Yk ti}| |{ti|Xk ti}| Association Rules (3) • Determine T = {{ai,1,…,ai,n}| (ai,1, ai,2)CP  m >2 ((ai,1, ai,m) H  (ai,2, ai,m) H)} • Determine support and confidence for all association rules Xk Yk, where |Xk|=|Yk|=1 • Propose to the user only the rules that exceed user-defined thresholds support (Xk Yk) = confidence (Xk Yk) =

Conceptual Clustering (1) • Use of conceptual clustering approach [2,5] to extract a hierarchy of concepts and to learn subcategorization frames • In our case, examples to cluster are set of words, associated to the frequency of the corresponding instantiated frame in the corpora • Syntactic parser provides parsed sentences including attachments of noun phrases to verbs and clauses<to travel> <subject: father> <by: car><to travel> <subject: neighbor> <by: train><to drive> <subject: friend> <by: car><to drive> <subject: colleague> <by: motor-bike><to drive> <subject: friend> <by: motor-bike> • Unambiguous parsed sentences is not a requirement, noise is taken in account • The meaning of the concepts of the ontology is characterized by the subcategorization frames they appear in

Conceptual Clustering (2) E.g.: <to travel> <subject: father> <by: car><to travel> <subject: neighbor> <by: train><to drive> <subject: friend> <by: car><to drive> <subject: colleague> <by: motor-bike><to drive> <subject: friend> <by: motor-bike><to travel> <subject: [father(1), neighbor(1)]> <by: [car(1), train(1)]><to drive> <subject: [friend(2), colleague(1)]> <by: [car(1), motor-bike(2)]><to travel> <subject: human> <by: motorized vehicle><to drive> <subject: human> <by: motorized vehicle>

Conceptual Clustering (3) Clusters which have a maximum overlap (thus, clusters which contains the same words with the same frequencies) have to be merged.

Ontology Pruning • Ontology pruning is a data-driven means to reuse existing (general) ontologies in order to tailor them to a certain domain [4] • The approach uses data-oriented techniques that are based on word/concept frequencies • The idea is to compare the frequencies of words/concepts in two different corpora, one domain-specific and one generic • Words/concepts whose frequencies, in the domain-specific corpora, overcome of a certain percentage the frequencies of the same words in the generic corpora, are accepted, the others rejected

OL for Interoperability (1) • The key challenge here is to find semantic mappings between similar elements from two ontologies [1] • First problem: how can we define a meaningful similarity measure? • Second problem: how can we compute such measure using the available data? • An assumption here, is to have instances that can be used to learn concepts

P(A  B) P(A  B) P(A ,B) P(A , B) + P(¬A , B) + P(A , ¬B) A B OL for Interoperability (2) • Similarity Measure • Many definitions are possible (it is task dependent) • Many similarity measures are based on the joint probability distribution:P(A , B) – P(¬A , B) – P(A , ¬B) – P(¬A , ¬B) • Jaccardcoefficent – JC(A,B) = =

[N(U1A,B) + N(U2A,B)] [N(U1) + N(U2)] OL for Interoperability (3) • Distribution estimator • We assume to have a set of instances that is representative of the universe covered by the ontology • N(UiA,B) is the number of instances of the ith ontology that belongs to both A and B • P(A , B) = • Problem: what if A and B does not belong to the same ontology? (because this is our case!)

OL for Interoperability (4) R U1A t1, t2, t3, t4 Trained Learner L t5, t6, t7 A C D t5, t6 t7 U1¬A E F t1, t2 t3, t4 G U2A , B U2¬A , B U2B L s1, s3 s2 , s4 s1, s2, s3, s4 B H s1 s5, s6 s5 s5, s6 s6 U2¬B U2A , ¬B U2¬A , ¬ B I J s2 s3, s4

OL Tools (KAON) • http://kaon.semanticweb.org • Open Source • Java based • Implements a modular framework • Text2Onto, module for OL from text (association rules, see Association Rules (1)) • Ontology Pruning implemented (simple filter on TF)

References [1] A. Doan, J. Madhavan, P. Domingos, A. Halevy. Learning to map between ontologies on the Semantic Web. In Proceedings of the 11th International World Wide Web Conference (WWW 2002), Hawaii, USA, May 2002. [2] D. Faure, C. Nedellec. A corpus-based conceptual clustering method for verb frames and ontology acquisition. In 1st International Conference on Language resources and Evaluation -- Workshop on Adapting lexical and corpus resources to sublanguages and applications, Granada, Spain, pages 1--8, 1998. [3] G. R. Maddi, C. S. Velvadapu, S. Srivastava, J. Gil de Lamadrid. Ontology Extraction from text documents by Singular Value Decomposition. [4] A. Maedche, R. Volz, R. Studer, B. Lauser. Pruning-based identification of a domain in ontologies. In Proc. of I-KNOW'03, Graz, Austria, 07 2003. [5] A. Maedche, V. Zacharias. Ontology-based Instance Clustering. In proc. of ECML/PKDD. Springer, 2002. [6] A. Maedche, S. Staab. Discovering Conceptual Relations from Text. In Proc. Of ECAI-2000.

Knowledge Discovery in Ontology Learning