350 likes | 367 Views
Explore the importance of building corpora for minor Indian languages, their status, and the need for technological research and linguistic preservation. Learn about the categorization of languages and objectives for utilizing language technology to preserve endangered languages.
E N D
EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES OF INDIA SOME ISSUES Dr.B.Mallikarjun Central Institute of Indian Languages Mysore 570 006, INDIA mallikarjun@ciil.stpmy.soft.net www.ciil.org/faculty/mallikarjun.html www.ciilcorpora.net
Overview of the Presentation 1.Current status of corpora – major Indian languages 2.Current status of corpora - minor Indian languages 3.Importance of minor languages corpora 4.Objectives 5.Categorization of minor languages for corpora building 6. Minor languages: A sample 7.Issues in corpora building 8. Corpus processing tools – a. Basic b. Advanced 9.Conclusion and a mission EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Major Indian Languages India has 1652 mother tongues of 4 families.The Constitution of India in 8th Schedule has recognized 18 languages spoken by 96.29% of the population. Current Status of their corpora Assamese : 2,622,836 Bengali : 3,535,863 Gujarati : Hindi : 3,003,004 Kannada : 2,239,537 Kashmiri : 2,266,588 Konkani : Malayalam: 2,349,526 Manipuri : Marathi : 2,213,241 Nepali : Oriya : 2,727,670 Punjabi : 1,966,260 Sanskrit: Sindhi : Tamil : 3,381,525 Telugu : 3,967,926 Urdu : 1,64,125 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
* Different quantum.* Comparable quality. * Quantum and coverage is inadequate for wider NLP activities. * Needs to be augmented with wider coverage.* Enhancing attempts have some problems needing immediate solution. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Minor Indian languages and status of corpora * 1634 are minor languages spoken by 3.71% of the population. * Indo-Aryan and Dravidian language families have both major and minor languages. * Almost all the languages of the other two families, Munda and Tibeto-Burman are “minor” languages. * Text corpora building has not taken place in these languages. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Importance of minor languages corpora • Minor languages hardly attract the attention of the policy makers anywhere in the world. • These are endangered in Indian social, educational and linguistic contexts. • Linguists evince great interest to study the richness of languages and try to save the endangered languages from extinction. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
They hardly attract and become source fortechnological research. • Technology has made it possible to empower all languages whether they are major or minor ones. • Creating corpora in minor languages, especially those that have small or no written literature have certain critical advantages for linguistic computing. • Experimentation with corpora designs and standards is more easily done in these languages because of manageable quantum of data. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Objectives Archival and cross-linguistic comparison within a language family and across language families. Utilize language technology for their preservation and continued use. Fine-tune language analysis where grammatical analysis is available. Use machine readable form of the texts to produce possibly precise analysis of the language where ever such analysis is not available. Also use some of the minor languages corpora for machine translation purposes. Speech corpora too has more significance in minor languages, since most of them exist in spoken form and many are yet to be rendered into written form. Indigenous knowledge systems: Most of the minor languages are resources of cultural heritage and a treasure house of indigenous knowledge systems. Once the same is available in the machine readable form by using UNL can be made available to the universal knowledge base. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Categorization of minor languages Minor languages can be classified into 3 groups on the basis of the issues to be tackled while building corpora. First category : Languages other than the 18 major languages having good amount of literary and other texts and also used in wider domains like : Bodo, Kurukh, Maithili, Santhali, Tripuri etc. Second category : Languages are the once with limited quantity of written texts but not widely used in different domains such as education, administration etc. like : Kodava, Tulu, etc. Third category : Languages available only in spoken form and yet to be rendered into written form like Toda, Kota, Yerava, etc. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Name Lg.family Text Script No. of speakers Maithili Indo Aryan Yes Devanagari 77,66,597 Kodava or Coorgi Dravidian Very less Kannada 97,011 Yerava Dravidian Indigenous Knowledge System No script 13,689 Minor languages : A sample These languages are representative of the ground linguistic reality in India. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Issue Majorlanguage Minorlanguage Text Sampling - domains Period All available text / All transcribed speech Maithili, Kodava and Yerava Technical: key-board, input and storage Standard software based on the grammar of the concerned script and UNICODE for Kannada: - 1, 2, 3, 4. In-compatibility of adopted software not accommodative of all the features of Maithili, Kodava and Yerava Issues in corpora building EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Corpus processing tools Basic tools for statistical analysis Frequency count of words and syllables : The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications are made and used. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Statistical Analysis Comparison of Maithili, Kodava and Yerava Corpora Statistical distribution Maithili Kodava Yerava Corpus size 328146 9432 3881 Word types 51902 6050 3030 Most frequent Syllable ka ra ru Average Word length% 3.52 5.70 3.10 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Comparison of Maithili and Hindi Corpora Statistical distribution Maithili Hindi (CIIL) Hindi (Naiduniya) Hindi (India Today) Hindi (Premchand) Corpus size 328146 2327129 3140729 1566779 671171 Word types 51902 189860 71953 47640 24745 Most frequent Syllable ka ka ka ka ka Average Word length% 3.52 4.96 4.96 4.71 4.36 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Comparison of Kodagu, Yerava, Kannada and Malayalam Corpora Statistical distribution Kodagu Yerava Kannada Malayalam Corpus size 9432 3881 1977987 2119935 Word types 6050 3030 346850 526802 Average Word length% 5.70 3.10 8.68 10.25 Average sentence length % 4.64 4.36 8.42 6.93 Most frequent Syllable r a r u r u r a EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Basic tools for retrieval Key Word in Context Search by required word Sorting and indexing The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications can be made and used. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Advanced tools for analysis • Part-of-speech tagging • Morphological analyzer EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Part of speech tagging • Non availability of standard basictag set is one of the major drawbacks. • Each Institution/group of scholars use their own notations: CLAWS, Research institution in IT, • CIIL(Maj lg.), CIIL(Min lg.) • 3. The tagging tools being developed even for major languages are at different stages of development. 4. The POS tagging tool developed for Hindi can be tried out at the first instance on Maithili to see its viability. Hindi too is not having fully working POS tagging tool. 5. Due to limited data in Kodava and Yerava manualtagging is preferred. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Morphological analyzer The Morphological Analyzers designed for the minor languages of India should be sensitive enough to take care of their specific features. • Tagged lexicon • Rules to cover the processes of: • Inflection - Suffixing is normally based on word ending • Derivation – Both prefixing and suffixing are possible – • depends on lexical item EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Semantics and Pragmatics Yerava word ‘-ati’ has three meanings such as ‘to sweep’, ‘wind blow’ and ‘bottom’ for which meaning has to be taken depending upon the context. In such of these cases the morphological analyzer demands a semantic tool. Kodava word bappe has the meaning ‘I am coming’ but when it is used in the context of leave taking, it means, ‘I am leaving.’ Cultural nuances in the context of leave taking do not allow one to use the word poope ‘going or leaving’ because it would only mean that the person is saying the ultimate good-bye to this world. It is possible to judge the meaning of such words only with the knowledge of the culture represented by a language. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Disambiguation Ambiguities are seen in three senses - Word sense, Pronoun sense and Structural sense. Word sense ambiguities are words having multiple meanings that will be found in all the languages. With regard to the second one, pronominal and adjectival anaphora are also ambiguities. In English, disambiguation tools have been developed. After the inception of a few lexical databases such as Word Net, Euro Net, etc., researchers seem to have overcome the ambiguity problem to certain extent. In the case of Indian languages, however, in the absence of such a sensitive tool, one has to work manually in order to cross over disambiguate even in the case of major languages. Minor languages need better linguistic analysis to arrive at tangible and usable disambiguation procedures. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Conclusion • India abounds in many endangered languages. Technology can actually help maintain a language. • Technology should immediately take into account the concerns of minority languages. Especially, major language technologies of the region should accommodate the needs of the minor languages too. • Corpora building in minor languages poses new challenges to innovate novel ways to accommodate and adequately describe the distinctive features of these languages. • Comparison of corpora studies - within a family of languages, across the families of languages and at the international level will be helpful in bringing out a standard module of developing corpora. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Thank You EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
8.1 Kannada Code Chart EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Demography Astrology Criminology Physical Education / Sports Health and Family Welfare Forestry Sexology Culture & Anthropology Commerce Banking Accountancy Industry & handicrafts Finance Textile Technology Official And Media Languages Mass Media Legislative Administrative Translated Material Literature Scientific Legal Administration Translated Psychology Film Technology Photography Marine Biology Fisheries Textile Technology Social Sciences Sociology Linguistics Psychology Anthropology History, Archeology, Epigraphy Political Science Home Science Library Science Religion, Philosophy Economics Logic Journalism Folklore/Mythology Public Administration Law Business Management Education Text Books-Social Science Natural, Physical And Professional Sciences Botany Zoology Geology Geography Bio Chemistry Micro Biology Physics Chemistry Mathematics Statistics Computer Sciences Astronomy Text book(Science) Medicine Ayurveda Homeopathy Yoga Naturopathy Engineering Architecture Oceanology Agriculture Veternary Aesthetics Literature Novel Short Story Essays Criticism Humour Children 's Literature Biographies & Autobiographies Travelogues Letters/Diaries/ Speeches Plays Science Fiction Folk Tales Text Books(School) Social Sciences Fine Arts Music Dance/Impersonations Drawing Sculpture Musical Instruments Hobbies EACL 2003, CLSAL: Budapest – April 12 – 17, 2003
Thank You EACL 2003, CLSAL: Budapest – April 12 – 17, 2003