340 likes | 470 Views
Issues in Multilingual Thesauri. Managing Content. Managing Content relevant and related to an organization Documentary Resources Internally generated reports and other resources Web Resources CMS combine a variety of tools & technologies. Managing Content. Involves Capturing Storing
E N D
Managing Content • Managing Content relevant and related to an organization • Documentary Resources • Internally generated reports and other resources • Web Resources • CMS combine a variety of tools & technologies
Managing Content • Involves • Capturing • Storing • Managing • Preserving; and • Delivering Information
Managing Content • Document management • Collaboration • Web content management • Records management - long-term storage Need for Vocabulary management; • Consistency in content representation • By Creators – authors • By Indexers • By Searchers Thesauri are important tools for this purpose
LINGUSITC DIVERSITY IN GLOBAL INFORMATION NETWORKS AND UNIVERSAL ACCESS TO INFORMATION IN CYBERSPACE ARE AT THE CORE OF CONTEMPORARY DEBATES AND CAN BE A DETERMINING FACTOR IN THE DEVELOPMENT OF A KNOWLEDGE-BASED SOCIETY UNESCO
“… multilingual tools are getting importance as increasingly diverse groups from different cultural and linguistic backgrounds seek access to equally diverse pieces of information…” • Jorna & Davies • J.Doc.
Multilingual Thesauri • Multilingual Thesauri support, among other things: • Cross-walk between KO tools • Cross-cultural communication (including comparative studies) • Navigation between semantically related concepts (Terms) • Semantic navigation between concepts in a domain and related knowledge resources (bibliographical metadata, etc)
Multilingual Thesauri [Contd.] • Intelligent query expansion • Linguistic Research Future • Improved natural language processing • Language recognition • Improved parsing • Concept resolution • Inferencing / Reasoning - Ontology
Background • Early DRTC interest in Thesaurus Building • F-Thes • OM Information System • The Present Project • Digital Library of Tamil Classics Characteristics: • More than one language • Culture-Specific Domains
Subject Coverage Time / Period Structure & Presentation F-THES Religious Mysticism No period restriction Structure defined to generate independent language thesauri, if required; Context specifying elements used only occasionally TAMTH Entire universe of subjects Sangam Period Structure based on Tamil terms as the base / source (descriptor) with corresponding terms in English language; Context specifying elements used for every Descriptor
Background [Contd.] • The objective: • To employ the new thesaurus for vocabulary management: • In Indexing • User Interfaces • formulating search expressions and search strategies • Facilitating navigation between related terms (Narrower, Broader and other Related terms) • Value addition via links to relevant lexical tools
Issues • Humanities vis-à-vis Sciences • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs
Issues • Focus on: • Vocabulary management in bilingual and multilingual thesauri in culture-specific domains; • Special aspects of the Tamil language in this regard; • Alternative ways of linking descriptors to lengthy lists of NTs and RTs; • Advantages of integrated use of two or more knowledge organization tools • Many of the issues discussed here are unique to Thesauri in the domains of Humanities
Issues • Humanities vis-à-vis Sciences • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs
The Approach • Combining existing thesauri • Merging two or more existing thesauri • Linking existing thesauri to each other • Translating an existing thesaurus into one or more other languages • Building a new thesaurus ‘bottom up’ • Starting with one language and adding another language or languages • Starting with more than one language simultaneously
The Approach • The candidate terms: • The corpus; Both print-on-paper and electronic sources; E.g., 1)Cologne online Tamil lexicon. [Based on Tamil Lexicon and supplement, 1924-1939]. http://webapps.uni-koeln.de/tamil/ (COTL) 2)Commemorative bibliography of the first 1008 books published by the South India Saiva Siddhanta Works Publishing Society / By S.R. Ranganathan and R. Muthukumaraswamy. Tirunelveli: The Society; 1961. 3)Periya puranam: a Tamil classic on the great Saiva Saints of South India / By Sekkizhaar. Condensed English version by G. Vanmikanathan and N. Mahalingam. Madras: Sri Ramakrishna Math; [1985]. 4) Sub-forms of Tamil poetry and their classification / By S.R. Ranganathan and V.Thillainayagam. Annals of Library Science, 10(3); 1963; 175-185 5)WordNet 2.1 (online) 6) Murugan, V. (200). Tolkappiam in English: Translation with the Tamil text translileration in the Roman script, Introduction, glossary and illustrations / Project Director; Dr. G. John Samuel. Chennai: Institute of Asian Studies. ISBN 81-87892-05-6. 7) Tamil lexicon (1924-1939). Published under the authority of the University of Madras. Reprint 1982. v.I-VI + Supplement. 8) Thillainayagam, V. (1978). The cultural heritage of the Tamils: Library studies. Madras Institute of Tamil Studies, Seminar on Cultural Heritage of Tamils, 25-27 February 1978; p. 292-333. Also published in Pulamai, v.4, No.3-4; July-September 1978; p.253-299.
The Approach [Contd.] • To Create records in an alphabetical fashion (from a to z); This was found to be tedious; • The terms in the corpus were grouped into broad categories – based on Basic Classes of C.C. • The thesaurus is being maintained as a database (using WINISIS)
The Approach [Contd.] • Candidate concepts • Titles of Classics • Quasi classes; have attracted other works upon themselves;
The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs
Script & Transliteration • Terms entered in the Roman script using the COTL scheme for transliteration (This is used by the Tamil Lexicon) • Supports automatic conversion to Tamil script • Records will eventually be in Tamil script
Issues • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs
Semantic Issues • Equivalence • Within the Language • A large number of synonyms in Tamil • Across Languages • Concepts unique to a culture (and so to the language); Non-Availability of terms in English for a large number of concepts • Near equivalent concepts • Use the original term
Semantic Issues [Contd.] Example • tAmarai (lotus) • mirunALam (Stalk of the Lotus) • tAmaraimuL (thorny portion of the stalk of the lotus)
Search Term No. of Records tAmarai 327 Entries with tAmarai as entry word or in the explanation kamalam 36 entries with kamalam as entry word or in the explanation Lotus 309 entries with Lotus as entry word or in the explanation Multiplicity of Synonyms • tAmarai – 82 synonyms in Tamil
Semantic Issues [Contd.] • cAttunARRu = Young plants planted in place of the dead ones • aSTAgkaputti = Eight Kinds of Knowledge • cARvAkam = cAruvAka’s materialistic philosophy which says perception is the only source of knowledge
Semantic Issues [Contd.] • Homographs • tAmarai = Lotus plant; Lotus flower; Lotus as a shape (entities in the shape of a lotus); Lotus-like properties (e.g., soft like lotus petals) • appu = Thigh; Father; Loan; Debt; Domestic male servant; Water; Trumpet tree; Sixth division of day • May also have to do with the evolution in the meaning and connotation of terms in Tamil • kurinchi, mullai, marutam, neitl, and palai
Semantic Issues [Contd.] • Homographs • Elam (spice) • SN Cardamom plant, elettaria cardamomum; cardamom • UF ilAjncali (spice) • UF ilAjnci (spice) • UF kALintam (spice) • UF kaNmali (spice) • BT tAparavastu (plant) • BT2 tAparanUl (botany) • IlAjncali (spice) • Use Elam (spice)
Homographs • The real meaning is to be understood in the context; Extensive use of Role Operators. Examples: • iTimpam (baby); iTimpam (castor); iTimpam (egg); iTimpam (misery); iTimpam (spleen) • Inverted Index will help users to select appropriate search term
Issues • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs
Structural Issues • Hierarchy • Difficulties in developing corresponding hierarchies in two languages • Large Number NTs • Alternative Ways of Managing • Associative Relations • Links to Online lexical tools
tAmarai mirunALam tAmaraimuL Lotus Stalk of the Lotus thorny portion of the stalk of the lotus