260 likes | 357 Views
Chinese Core Ontology Construction from a Bilingual Term Bank. Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University. Outline. Introduction Related Works Algorithm Design– COCA Performance Evaluation Conclusion. Introduction.
E N D
Chinese Core Ontology Constructionfrom a Bilingual Term Bank Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University
Outline • Introduction • Related Works • Algorithm Design– COCA • Performance Evaluation • Conclusion
Introduction • What is a Core Ontology • A mid-level ontology • Bridges the gap between an upper ontology and a domain ontology
Concepts and Terminologies • Upper Ontology • A general ontology to ensure reusability across different domains (e.g.: Computer Program in SUMO) • Domain Ontology • An ontology conceptualize a specific domain (e.g.: Free Software in IT domain) • More application dependent, more extents of concepts • Midlevel Ontology(Core Concept) • Basic concepts of a domain • More application independent, more intents of concepts. • core ontology (e.g.: Software) • Frequently used, ability to form other concepts • Core Terms • Lexical units of core concepts
Related Works • Manually constructed ontologies • SUMO • Famous upper level ontology works based on lexicon • CoreLex (Buitelaar, P., 1998) • EuroWordnet (Rodríguez, 1998 ) • Ontology harmonization: Core ontology • “Towards a Core Ontology for Information Integration” (M. Doerr, 2003) • A most similar work • “Enriching Core Ontology with Domain Thesaurus through Concept and Relation Classification ” (Huang, 2007) • Use Concept and Relation Classification to Enrich core ontology
Our Previous Works • Chinese terminology extraction • Chinese core term extraction(Ji et al, 2007) • Preliminary work on automatic construction of core ontology construction using English-Chinese Term Bank (MRCOCA, Ontolex 2007, Chen, 2007) • Bilingual lexicon • Extended strings • Frequency information in synset • Weight from extended strings are integrated into final weight by simple addition • Mapping to synset and SUMO can only achieve accuracy of about 50%
Issues • What kind of concept should be included? • How to identify core concepts • If through core terms, disambiguation • What and how to identify relations? • Making use of available resources • Chinese NLP resource scares • English NLP resources abundant
Requirements of Core Ontology • The concepts must be widely accepted and commonly referenced • Corresponding core terms must be highly used and productive • The concepts/terms can be mapped to upper ontology. So the core ontology can inherit the attributes provided by upper ontology
Core Ontology Construction Algorithm(COCA) for Chinese • Extract Chinese core terms from a bilingual term bank • Mapped core termTcto English terms • Mapping English terms to WordNet • Mapping synset to a upper ontology concept in SUMO
COCA - Resources Used • ITCTerm • a domain specific core term list (Chen, 2007) • CETBank • Chinese-English bilingual term bank • 1,500 most productive core terms extracted can serve as suffixes to form more than 50% of the terms in CETBank) • WordNet • SUMO • Mappings between WordNet and SUMO
COCA – Statistical Translation Module Translation ambiguity: Each Chinese core term TC∈ ITCTerm has a set of translations T_SetE , TE∈T_SetE • Objective • to estimate the likelihood of every translation using extended terms of TC • P(TE | TC) for all TE∈ T_SetE.
COCA - Sense Disambiguation Module • Mapping a given TC to the Synset S through its translation setT_SetE (TC) • Mapping probability of a English term TEto take a synset S using freq. info in WordNet • Mapping probability of TC to take a particular synset S via an English translation TE
COCA - Concept Selection Module • Combining three features • multi-path feature • hypernyms feature • part-of-speech feature • Using Union Probability of Independent Events
Feature 1 –Multi-Paths to Synset Multiple paths is the path between Chinese core terms and synset via different English translations The feature mergesthe probability of multiple paths
Feature 2 – Hyponyms in domain Incorporate info on all the extended strings Extended String uses the core term as headword and is the hyponym of the core term Length Ratio Union Probability of Independent Events
Feature 3 – Part of Speech Probability of the POS tagpos(S) owned by a synsetS given a core termTc PoS Tag estimation: Heuristics on Adj, Verb, and noun based on position
Integrate Features • Using Union Probability of Independent Events
Evaluation • Algorithm Output • A pair of < Tc_i, Synseti > for each Chinese core term with the highest mapping weight • Evaluation Standard • For each Tc_i, whether their mappings to Synset are the best match with respect to this domain • Answer Preparation • Answer is manually made by two experts in IT domain respectively on the same set of data
Performance • The evaluation conducted on the top N frequent core terms • The algorithm COCA achieves 71% in accuracy (N is 28 in this paper) • Compared to the result of MRCOCA (Chen, 2007) which achieved only 50% • Two examples of core term to syntset mapping generated by the algorithm are given for “软件” and “网络”.
Conclusion • Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries show that the algorithm is • “42%” improved in accuracy compared to MRCOCA (Our Previous Works) • The three features and the new algorithm based on probability made the improvement
Term bank can help to quickly construct domain core ontology by selecting the concept nodes and relations used in domain • Bilingual term bank can further introduce the second language realization of the core ontology effectively and automatically
Future Works • Evaluation on three features • how effective they are • how much they contribute to the final performance • Consideration of more features such as abbreviation, synset of head word of core term and etc. • Use of other resources
Q A