140 likes | 263 Views
Learning Taxonomic Relations from Heterogeneous Evidence. Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004). Purpose. To examine the possibility of learning taxonomic relations by considering various sources of evidence
E N D
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
Purpose • To examine the possibility of learning taxonomic relations by considering various sources of evidence • Main aim: • To gain insight into the behavior of different approaches to learn taxonomic relations • To provide a first step towards combining these different approaches • To establish a baseline model for further research
Introduction • Taxonomies or conceptual hierarchies are useful in many NLP applications. • However, the development of suitable ontologies is time-consuming. • Automatically acquiring ontological knowledge is required. • The approach proposed in this paper learns taxonomic relations (is-a relation) by considering four different evidences: • Hearst-patterns matched in a large corpus • Hearst-patterns matched in WWW • WordNet • The ‘vertical relations’-heuristic
Introduction • Goal: • Learning is-a relations in tourism domain • Training Corpus: • Domain-specific: • http://www.lonelyplanet.com • http://www.all-inall.de • General: • British National Corpus • The ontology for evaluation: • A tourism reference ontology modeled by ontology engineer. • A few abstract concepts are removed. • 272 concepts, 225 direct is-a relations, and 636 non-direct is-a relations
Hearst Patterns • Lexico-syntactic patterns proposed by Hearst (1992). • N such as N1, N2,… • such N as N1, N2,… • N1, N2,… and other N • N, (especially | including) N1, N2,… • From these patterns, we could derive is-a(Ni, N). • Numbers of Hearst-patterns between different terms are recorded and normalized to 0~1. • Different thresholds are set and experimented.
WordNet • WordNet is not “unstructured” source of evidence. • However, it is general and domain-independent. • One term may have several senses and there may be more than one hypernym relation between two terms. • Two different strategies are used: • Normalizing all hypernym paths between two terms: • Considering only the most frequent sense of t1
‘Vertical Relations’-Heuristic • Given t1 and t2, if t2 matches t1 and t1 is additionally modified by certain terms or adjectives, the relation is-a(t1, t2) is derived. • Ex. is-aHEURISTIC(international conference, conference)
World Wide Web • Google API (http://www.google.com/apis/) is used to count the matches of certain Hearst-patterns in the Web. • The sum of the number of Google hits over all patterns for a certain pair (t1, t2) is normalized by dividing through the number of hits returned for t1.
Conclusion and Further Work • A simple combination strategy improves the results. • It remains further work to find out if other sources of evidence could be integrated into this approach. • It could turn out to be useful to only consider domain-specific text collections instead of a general corpus such as the BNC and to consider only pages in the World Wide Web related to the domain. • It remains as a challenge to determine the optimal strategy to combine the different approaches. • In order to apply machine learning techniques for this purpose, it is necessary to cope with the high number of negative examples.