600 likes | 753 Views
Text-based Construction and Comparison of Domain Ontology: A study based on classical poetry. Chu-Ren Huang Academia Sinica. Outline. Motivation and Framework: Laying the foundation Basic Resources: The building blocks From General Ontology to Specific Ontology: Study of Shu-Shi Poems
E N D
Text-based Construction and Comparison of Domain Ontology: A study based on classical poetry Chu-Ren Huang Academia Sinica Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Outline • Motivation and Framework: Laying the foundation • Basic Resources: The building blocks • From General Ontology to Specific Ontology: Study of Shu-Shi Poems • Epilogue: From Specific Ontology to General Ontology • Conclusion Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Motivation and Framework: Laying the foundation Knowledge Structure Discovery Issues and Significance Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Knowledge and Knowledge Structure Variation Knowledge is Structured Information • Most salient factors dictating variations in knowledge structures are time, space, and domain • Language is both the product and conduit of the conceptual structure of its speakers Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Knowledge and Structure Mismatch: a historical example 盧家少婦鬱金香,海燕雙棲玳瑁梁。 (from Tang 300) -Tulips (鬱金香)in Tang ? -No, the text refer to the fragrance of a ginger like herb -鬱金 ‘Young lady Lu, as fresh and fragrant as ginger grass, Looks on the pair of seagulls resting on the beam inlaid with sea turtle shells.’ Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Accessing Knowledge Structure • In order to become sharable and reusable knowledge, all extracted information must first be correctly situated in a knowledge structure • The situated information must be allowed to transfer from knowledge structure to knowledge structure without losing its meaningful content Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Research Goal • Knowledge Structure Discovery • Knowledge as situated information • Language endows information with structure • Text-based and Lexicon-driven Knowledge Structure Discovery • General Ontology: the upper ontology shared by all domains (such as SUMO) • Specific Ontology: a ontology specific to a domain, historical period, an author etc. Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Research Issues • Identification of Conceptual Atoms • Re-construction and Verification of Conceptual Structure • Knowledge Processing with Mismatched Knowledge Structures Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Knowledge Inferred form the Ontology of Tang Animals • No marsupials: only found in Australia, and only found much later • No marine mammals: Tang civilization activities mainly stays on land, as well as the dominance of hoofed animals (fascination with horses?) • Large number of birds among mammals, and the dominance of insects 昆蟲among invertebrates 無脊椎動物 Tang civilization’s fascination with flying [Birds fly. And insects are the invertebrates that have wings.] Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Research Methodology • The Mental Lexicon Approach • The Shakespearean-garden Approach • The Ontology-merging as Ontology-discovery Approach Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
The Mental Lexicon Approach • Concepts are stored in the mental lexicon • The basic unit of mental lexicon organization and access is lexical entry • A complete list of lexical entries covers the complete list of conceptual atoms • Lexical semantic relations mirror conceptual relations Each Word is a Conceptual Atom Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
The Shakespearean-garden Approach • A Shakespearean garden collects all the plants referred to in Shakespearean texts. • The garden is used to illustrate the flora of the Shakespearean England and gives scholars a context in which to interpret his work. • There is a knowledge structure behind each corpus (i.e. a collection of texts with design criteria) Lexicon as a Structured Inventory of Conceptual Atoms For instance, complete set of texts by an author, from a certain period, or in a certain domain Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
The Ontology-merging as Ontology-discovery Approach I • Ontology provides a structure for knowledge to be situated • However, there is a dilemma for the construction of a new ontology • If no existing ontology is referred to: reinventing the wheel, difficult to start a structure from scratch without rules • If existing ontology is referred to: mislead by existing structure, mismatched or erroneous Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
The Ontology-merging as Ontology-discovery Approach II The Solution • Map conceptual atoms to two (or more) reference ontologies • Merge the two resultant ontologies • Matched Mapping: Confirmation of knowledge structure • Mismatched Mapping: Only one or neither is correct. Possibly lead to discovery of new knowledge structure • Complimentary Mapping: Increases coverage Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Further Developments • The Ontology of Chinese Characters: A common knowledge structure for East Asian Cultures • Contrary to earlier study of constructing specific ontologies based on general ontology, the Chinese character ontology will be a crucial general ontology based on a specific ontology Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Basic Resources: The building blocks From Text to Lexicon From Lexicon to Ontology Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Resources used • WordNet • SUMO Ontology • Academia Sinica Bilingual Ontological Wordnet (Sinica BOW) • Domain Lexicon Management System: Segmentation, New Word Detection Lexical Database Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Resources • Sinica BOW: SUMO+WordNet http://bow.sinica.edu.tw http://www.ontologyportal.org or http://ontology.teknowledge.com http://www.cogsci.princeton.edu/~wn/ • Segmentation Program etc. http://LingAnchor.sinica.edu.tw/ Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
SUMO: Suggested Upper Merged Ontology SUMO Atoms • Concepts: around 1000 Note that concepts are not necessarily linguistically realized • Relations(ISA): See SUMO Graph • Axioms: for inference • Open resource created under an initiative from IEEE Standard Upper Ontology Working Group Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Methodology • From lexicon to ontology (from items to structure) • Ontology discovery through ontology merging Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
WHY? • We do not have the knowledge structure (ontology) of a new domain (historical period, field etc.) • But typical ontology discovery needs a framework to be mapped to • To solve the dilemma we map the conceptual atoms to both SUMO and WN (as a linguistic ontology) Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
From General Ontology to Specific Ontology: Study of Shu-Shi Poems A Research Collaborated with Feng-ju Luo, Sue-ming Chang, and Ru-Yng Chang Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Opus Shu Shi蘇軾 • Who is Su Shi (A.D.1036-1101)? • One of the most prominent scholars in Song dynasty who is very knowledgeable and well-traveled. • 45 volumes (out of 50) of his work has already been digitized and segmented (by Feng-ju Luo) Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
How to build a domain ontology Word segmentation WordNet Match WordNet synsetand SUMO conceptautomatically SUMO Use WordNet information to check results and extend concept Transform into ontology browser format Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Distribution of Su Shi lexicon • 98,430 words in NO.1-45 volume Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
The distribution of animal, plant, and artifact concepts in Su Shi’s poems Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Comparing Two Ontologies: 300 Tang Poems and Collection of Su Shi’s Poems • One conceptual node missing in both ontologies: • 有袋類(marsupial) • Concepts found in Su Shi’s but not in Tang 300 • palm棕櫚科植物 (plant -> woody plant ->tree-> palm ) 椰葉(coconut palm)、檳榔* (betel palm) • 無枝林>食檳榔>月照無枝林, • 椰葉>追餞正輔表兄至博羅,賦詩為別>置酒椰葉桄榔間。 Guangdong and Hainan Island Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Comparing Two Ontologies: 300 Tang Poems and Collection of Su Shi’s Poems • Words stand for multiple concepts in the same source. Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
bird cuculiform_bird Cuckoo ani roadrunner coucal Centropus_sinensis pheasant_coucal shrub bush rhododendron azalea Example of WordNet lexical relation 杜鵑 DuJuan Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
SUMO WordNet bird organism cuculiform_bird plant animal Cuckoo invertebrate vertebrate ani roadrunner coucal Flowering plant Centropus_sinensis pheasant_coucal warm blooded vertebrate shrub bush mammal bird rhododendron azalea SUMO + WordNet 杜鵑 DuJuan Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
What We Learned about Specific Ontology Constructing ontology from a larger corpus and comparison of two specific ontologies • Local information can be effectively mapped • Global information offers deeper insights into the knowledge structure ☆Human conceptualization of animals and plants has been relatively stable. But NOT artifacts. ☆Regardless of the criteria for classification, genetically determined features (behaviors, appearances etc.) do not vary greatly ☆However, human technology is highly fluid. Our conceptualization of artifacts is highly dependent on the development of engineering and by our varying societal needs. Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Towards a Workbench for Specific Ontology: Browser and Editor User login Function menu (Personal ontologies list) Browse an ontology Edit an ontology Add an ontology Logout • SUMO • SUMO • + WordNet • +concept map with lexicon • Update lexical concepts • Update mapping between WordNet synset and lexicon • Edit other information in lexicon Import text Import lexicon Word segmentation Match concept and synset automatically • Suggestion list • Missing list Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Constructing a Specific Ontology • Import text, or domain lexicon • Select style of writing • Select category of word list for word segmentation • Select reference ontologies to match SUMO and lexicon • Information of suggestion list • Candidate synset • Candidate synset synonyms • Explanation of candidate synset • Concept of candidate synset Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Example of SUMO concept Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
http://bow.sinica.edu.tw/ont/SuShi_ont.html Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Summary and Future Work • Ontologies represent the knowledge structure of a domain or historical period • We have provided an online interface to browse ontologies and lexica • In the future, we will complete the online ontology editor and browser, which will • Map lexicon, WordNet and SUMO. • Integrate ontologies based on different texts. • Facilitate comparative studies of various domain ontologies. Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
From Specific Ontology to General Ontology漢字知識本體An introduction to Hanzi ontology Research in Collaboration with and Conducted by Ya-Ming Zhou周亞民 Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Outline • Introduction • The logographic features of Hanzi • Semantic symbols of Hanzi • The structure of lexicon relation • The structure of Hanzi ontology • Summery Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Introduction (1/2) • Ideograph: Each Chinese character (kanji) is a writing unit which also represents a pre-defined concept. The represented concept is independent of phonological variations, including language changes and cross-lingual adaptation • The complete Han writing system is expected to consists of 40,000-70,000 characters each representing one or more concepts. Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Logographic Features of Hanzi • 馬 is a semantic symbol of horse • Examples: • 驩:馬名 a kind of horse • 驫:眾馬 horses • 騎:騎馬 riding a horse • 驍:良馬 a good horse • 驚:馬驚 a scared horse 馬 Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Semantic Symbols in Hanzi(1/3) • The characteristics of Hanzi mainly come from semantic symbols. • According to Xyu Shen’s ShoWenJieZi (100 A.D.) , there are 540 semantic classes (radicals) • These radicals represent the knowledge structure of Hanzi. Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Semantic Symbols in Hanzi • 540 radicals are used to classify all Chinese characters and represented • The semantic symbols about animals: • 鳥(bird),隹(bird),犬(dog),馬(horse),羊(sheep),虫(insect)… • The semantic symbols about plant: • 艸,木,竹,禾… • The semantic symbols about religion: • 示 Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.
Plants Description Usage Name Parts 蕉蘭芒蒙菌蔓苦菊茱范荷茅蕈蔚菲草 茲蒼芳落茸茂荒薄芬蒸莊 蕃藥蔬菜薪苑藩藉茭 The Classification of Hanzi with 艸(艹) Description Usage Parts 萌莖芽茄苗蓮葉 Chu-Ren Huang. PACLIC 18, 2004. Waseda University, Tokyo, Japan.