560 likes | 816 Views
XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences. Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India shalini@vidyanidhi.org.in. Vidyanidhi Digital Library. Vidyanidhi began as a pilot project in 2000
E N D
XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore,India shalini@vidyanidhi.org.in Indo-US Workshop, June 25, 2003
Vidyanidhi Digital Library • Vidyanidhi began as a pilot project in 2000 • Supported by the NISSAT, DSIR, GOI • Objective was to demonstrate the feasibility of an Electronic Thesis and Dissertation( ETD) Initiative in the Indian Context • It is now evolving into a national effort • Supported by the Ford Foundation Indo-US Workshop, June 25, 2003
Vidyanidhi:Vision To evolve into a information infrastructure to strengthen the research capacities of Indian Universities by- • Developing accessible digital libraries of theses and dissertations. • Sensitizing and training doctoral research students in Scholarly writing, E-publishing and ETDs • Developing appropriate policies • Developing/making available requisite tools and resources Indo-US Workshop, June 25, 2003
Vidyanidhi: Strategies • Policy Framework – through meetings, liaison, participation • Education and Training • Content Building- full text and metadata • Resources and tools (software,interfaces…) Indo-US Workshop, June 25, 2003
Indian Academic Research Output • Large system of higher education • More than 300 universities-reservoir of extensive doctoral research work • Doctoral research output-around 30,000 annually • English is the predominant language • Increasing vernacularisation –20-25% in Indian Languages • This trend is increasing resulting in more and more research output in Indian Languages Indo-US Workshop, June 25, 2003
Language Interoperability • Vidyanidhi approach has been guided by the language inter operability factor • Our choice of technology and tools will have to be inter operable across languages Indo-US Workshop, June 25, 2003
Indian Languages: Diversity • The rich diversity in Indian Languages and scripts is simply overwhelming. • India is made up of a number of separate linguistic communities, each of which shares a common language and culture. • No of languages listed for India is 418 • 407 are living languages • 11 are extinct. • Many Languages -without script of their own Indo-US Workshop, June 25, 2003
Assamese Gujarati Kashmiri Malayalam Marathi Oriya Punjabi Sindhi Telugu Bengali Hindi Kannada Konkani Manipuri Nepali Sanskrit Tamil Urdu Eighteen Indian languages Indo-US Workshop, June 25, 2003
Language Families of Indian Languages • Indo European- North and Central India • Dravidian – South India • Mon-Khmer- Assam and some Eastern parts of India • Sino-Tibetan- Northern Himalayan and Burmese border area Indo-US Workshop, June 25, 2003
Indian Scripts • Interestingly, though the languages belong to four different language groups, Indian scripts have a common root/origin • Scripts of all Indian Languages are derived from Bhahmi • Greater uniformity in the arrangement of Alphabets Indo-US Workshop, June 25, 2003
Indian Alphabet: Characteristics • Consonants • Five Vargs (groups) • Non varg • Have an implicit + vowel • Anuswar ( a nasal consonant) • Chandrabindu ( a nasalisation Sign) • Visarg • Vowels and Vowel Signs • Vowel omission sign( Halant) • Conjuncts Indo-US Workshop, June 25, 2003
Indian Languages and scripts • Indic scripts are syllable oriented-phonetic based with imprecise character sets • The different scripts look different (different shapes) but have vastly similar yet subtly different alphabet base and script grammar Indo-US Workshop, June 25, 2003
Indian Languages and scripts:Issues • The Indic characters consist of consonants, vowels, dependent vowels-called ‘matras’ or a combination of any or all of them called conjuncts. • Collation (sorting) is a contentious issue as the script is phonetic based and not alphabet based Indo-US Workshop, June 25, 2003
Handling Indian Languages:Possible approaches • Transliteration - Glyph based approach • Indic characters are encoded in either ASCII or any other proprietary encoding • Use glyph technologies to display and print Indic scripts • Currently the most popular approach for desktop publishing. Indo-US Workshop, June 25, 2003
Handling Indian Languages:Possible approaches • Develop an encoding system for all the possible characters/combinations running into nearly 13,000 characters in each language-with a possibility of a new combination leading to a new character- an approach developed and adopted by the IIT Madras development team • Adopt the ISCII/Unicode encoding Indo-US Workshop, June 25, 2003
ISCII- Indian Script Code for Information Interchange • ISCII-91 -BIS Standard , IS 13194:1991 • An outcome of the efforts of Govt. of India, DOE, MIT, C-DAC and many other institutions • Is an 8 bit code • Is an extension of the 7 bit ASCII code • Top 128 characters cater to the 10 Indian Scripts Indo-US Workshop, June 25, 2003
Unicode • The Unicode consortium has encoded all of the world’s scripts • Unicode represents a carefully thought out ,technically impressive and a full featured attempt at encoding Indic Scripts • Unicode has unique code points for all of the Indic scripts Indo-US Workshop, June 25, 2003
Script Unicode Range Major Languages Devanagari U+0900 to U+097F Hindi, Marathi, Sanskrit Bengali U+0980 to U+09FF Bengali, Assamese Gurumukhi U+0A00 to U+0A7F Punjabi Gujurati U+0A80 to U+0AFF Gujarati Oriya U+0B00 to U+0B7F Oriya Tamil U+0B80 to U+0BFF Tamil Telugu U+0C00 to U+0C7F Telugu Kannada U+0C80 to U+0CFF Kannada Malayalam U+0D00 to U+0D7F Malayalam Indo-US Workshop, June 25, 2003
Unicode implementation for Indic scripts • Despite the robustness ,technical soundness and practical viability, Unicode implementation for Indic scripts is almost non existent • Our search of the major databases-LISA, INSPEC, WOS did not show up any initiative in this direction • Vidyanidhi is an example of successful implementation of Unicode for Indic scripts Indo-US Workshop, June 25, 2003
Vidyanidhi approaches • Taking Indian Language thesis to the Web • Full Text • Metadata Indo-US Workshop, June 25, 2003
MS Word to XML Template for thesis in MS Word Student submits thesis in Word Convert to XML using the RTF to XML Converter Take them to the Web Indo-US Workshop, June 25, 2003
Full Text • Vidyanidhi provides tools for the creation of theses in Indian Languages • Our approach is to- • provide a style sheet /template on line • When the thesis is submitted then convert the same into to XML encoded in Unicode Indo-US Workshop, June 25, 2003
English Thesis Template Indo-US Workshop, June 25, 2003
Kannada Thesis Template Indo-US Workshop, June 25, 2003
Vidyanidhi database-approach… • Each script /language will have one table. Currently there are three separate tables for the three scripts- one each for Roman, Hindi (Devanagari), & Kannada • The theses in Indic languages will have two records -one in the Roman script (transliterated) and the other in the vernacular. However the theses in English will have only one record (in English) Indo-US Workshop, June 25, 2003
Vidyanidhi database-approach… • The two records are linked by the ThesisID number-a unique id for the record • The bibliographic description of Vidyanidhi follows the ThesisMS Dublin Core standard adopted by the NDLTD and OCLC Indo-US Workshop, June 25, 2003
Vidyanidhi - Platform • Microsoft • Windows XP supports all the 10 Indic scripts • Using Windows Glyph processing– • Open Type Font Format • Uniscribe-Unicode Script Processor • Open Type Layout Services library Indo-US Workshop, June 25, 2003
Vidaynidhi - platform • MS SQL 2000 • A truly multilingual-capable SQL • Achieves satisfactory collation • Front End- ASP • Java script Indo-US Workshop, June 25, 2003
Vidyanidhi:Accessing and Searching • One can search the Vidyanidhi Database either in - • In English ( Roman Script) • The integrated ( Master) database has metadata records for theses in all languages • Vernacular database has records of the specific language only Indo-US Workshop, June 25, 2003
Two approaches-differences • one affords search in the English language and the other in the vernacular. • The first approach also provides for viewing records in Roman script for all theses-search output- that satisfy the conditions of the query and also an option for viewing records in vernacular script for theses in vernacular Indo-US Workshop, June 25, 2003
The second approach- enables one to search only the vernacular database and thus is limited to records in that language. • However, this approach enables the search to be in the vernacular language and script Indo-US Workshop, June 25, 2003
Unicode and Indic Scripts • Vidyanidhi implementation dispels certain misconceptions and misconstructions about Unicode • Supposed problems- • Data Input • Display and printing • Collation Indo-US Workshop, June 25, 2003
Data input/Keyboard layout Our Test bed and comparison with other methods: • Unicode layout is as easy as the other in terms of speed • In terms of ‘no of key strokes’-No difference and some times Unicode method has less number of keystrokes involved • Data input was almost comparable to English records in terms of productivity Indo-US Workshop, June 25, 2003