1 / 56

Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India

XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences. Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India shalini@vidyanidhi.org.in. Vidyanidhi Digital Library. Vidyanidhi began as a pilot project in 2000

meir
Download Presentation

Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore,India shalini@vidyanidhi.org.in Indo-US Workshop, June 25, 2003

  2. Vidyanidhi Digital Library • Vidyanidhi began as a pilot project in 2000 • Supported by the NISSAT, DSIR, GOI • Objective was to demonstrate the feasibility of an Electronic Thesis and Dissertation( ETD) Initiative in the Indian Context • It is now evolving into a national effort • Supported by the Ford Foundation Indo-US Workshop, June 25, 2003

  3. Vidyanidhi:Vision To evolve into a information infrastructure to strengthen the research capacities of Indian Universities by- • Developing accessible digital libraries of theses and dissertations. • Sensitizing and training doctoral research students in Scholarly writing, E-publishing and ETDs • Developing appropriate policies • Developing/making available requisite tools and resources Indo-US Workshop, June 25, 2003

  4. Vidyanidhi: Strategies • Policy Framework – through meetings, liaison, participation • Education and Training • Content Building- full text and metadata • Resources and tools (software,interfaces…) Indo-US Workshop, June 25, 2003

  5. Indian Academic Research Output • Large system of higher education • More than 300 universities-reservoir of extensive doctoral research work • Doctoral research output-around 30,000 annually • English is the predominant language • Increasing vernacularisation –20-25% in Indian Languages • This trend is increasing resulting in more and more research output in Indian Languages Indo-US Workshop, June 25, 2003

  6. Language Interoperability • Vidyanidhi approach has been guided by the language inter operability factor • Our choice of technology and tools will have to be inter operable across languages Indo-US Workshop, June 25, 2003

  7. Indian Languages: Diversity • The rich diversity in Indian Languages and scripts is simply overwhelming. • India is made up of a number of separate linguistic communities, each of which shares a common language and culture. • No of languages listed for India is 418 • 407 are living languages • 11 are extinct. • Many Languages -without script of their own Indo-US Workshop, June 25, 2003

  8. Assamese Gujarati Kashmiri Malayalam Marathi Oriya Punjabi Sindhi Telugu Bengali Hindi Kannada Konkani Manipuri Nepali Sanskrit Tamil Urdu Eighteen Indian languages Indo-US Workshop, June 25, 2003

  9. Language Families of Indian Languages • Indo European- North and Central India • Dravidian – South India • Mon-Khmer- Assam and some Eastern parts of India • Sino-Tibetan- Northern Himalayan and Burmese border area Indo-US Workshop, June 25, 2003

  10. Indian Scripts • Interestingly, though the languages belong to four different language groups, Indian scripts have a common root/origin • Scripts of all Indian Languages are derived from Bhahmi • Greater uniformity in the arrangement of Alphabets Indo-US Workshop, June 25, 2003

  11. Indo-US Workshop, June 25, 2003

  12. Indian Alphabet: Characteristics • Consonants • Five Vargs (groups) • Non varg • Have an implicit + vowel • Anuswar ( a nasal consonant) • Chandrabindu ( a nasalisation Sign) • Visarg • Vowels and Vowel Signs • Vowel omission sign( Halant) • Conjuncts Indo-US Workshop, June 25, 2003

  13. Indian Languages and scripts • Indic scripts are syllable oriented-phonetic based with imprecise character sets • The different scripts look different (different shapes) but have vastly similar yet subtly different alphabet base and script grammar Indo-US Workshop, June 25, 2003

  14. Indian Languages and scripts:Issues • The Indic characters consist of consonants, vowels, dependent vowels-called ‘matras’ or a combination of any or all of them called conjuncts. • Collation (sorting) is a contentious issue as the script is phonetic based and not alphabet based Indo-US Workshop, June 25, 2003

  15. Handling Indian Languages:Possible approaches • Transliteration - Glyph based approach • Indic characters are encoded in either ASCII or any other proprietary encoding • Use glyph technologies to display and print Indic scripts • Currently the most popular approach for desktop publishing. Indo-US Workshop, June 25, 2003

  16. Handling Indian Languages:Possible approaches • Develop an encoding system for all the possible characters/combinations running into nearly 13,000 characters in each language-with a possibility of a new combination leading to a new character- an approach developed and adopted by the IIT Madras development team • Adopt the ISCII/Unicode encoding Indo-US Workshop, June 25, 2003

  17. ISCII- Indian Script Code for Information Interchange • ISCII-91 -BIS Standard , IS 13194:1991 • An outcome of the efforts of Govt. of India, DOE, MIT, C-DAC and many other institutions • Is an 8 bit code • Is an extension of the 7 bit ASCII code • Top 128 characters cater to the 10 Indian Scripts Indo-US Workshop, June 25, 2003

  18. Unicode • The Unicode consortium has encoded all of the world’s scripts • Unicode represents a carefully thought out ,technically impressive and a full featured attempt at encoding Indic Scripts • Unicode has unique code points for all of the Indic scripts Indo-US Workshop, June 25, 2003

  19. Script Unicode Range Major Languages Devanagari U+0900 to U+097F Hindi, Marathi, Sanskrit Bengali U+0980 to U+09FF Bengali, Assamese Gurumukhi U+0A00 to U+0A7F Punjabi Gujurati U+0A80 to U+0AFF Gujarati Oriya U+0B00 to U+0B7F Oriya Tamil U+0B80 to U+0BFF Tamil Telugu U+0C00 to U+0C7F Telugu Kannada U+0C80 to U+0CFF Kannada Malayalam U+0D00 to U+0D7F Malayalam Indo-US Workshop, June 25, 2003

  20. Unicode implementation for Indic scripts • Despite the robustness ,technical soundness and practical viability, Unicode implementation for Indic scripts is almost non existent • Our search of the major databases-LISA, INSPEC, WOS did not show up any initiative in this direction • Vidyanidhi is an example of successful implementation of Unicode for Indic scripts Indo-US Workshop, June 25, 2003

  21. Vidyanidhi approaches • Taking Indian Language thesis to the Web • Full Text • Metadata Indo-US Workshop, June 25, 2003

  22. MS Word to XML Template for thesis in MS Word Student submits thesis in Word Convert to XML using the RTF to XML Converter Take them to the Web Indo-US Workshop, June 25, 2003

  23. Full Text • Vidyanidhi provides tools for the creation of theses in Indian Languages • Our approach is to- • provide a style sheet /template on line • When the thesis is submitted then convert the same into to XML encoded in Unicode Indo-US Workshop, June 25, 2003

  24. English Thesis Template Indo-US Workshop, June 25, 2003

  25. Indo-US Workshop, June 25, 2003

  26. Indo-US Workshop, June 25, 2003

  27. Indo-US Workshop, June 25, 2003

  28. Indo-US Workshop, June 25, 2003

  29. Indo-US Workshop, June 25, 2003

  30. Kannada Thesis Template Indo-US Workshop, June 25, 2003

  31. Indo-US Workshop, June 25, 2003

  32. Indo-US Workshop, June 25, 2003

  33. Indo-US Workshop, June 25, 2003

  34. Indo-US Workshop, June 25, 2003

  35. Indo-US Workshop, June 25, 2003

  36. Vidyanidhi database-approach… • Each script /language will have one table. Currently there are three separate tables for the three scripts- one each for Roman, Hindi (Devanagari), & Kannada • The theses in Indic languages will have two records -one in the Roman script (transliterated) and the other in the vernacular. However the theses in English will have only one record (in English)  Indo-US Workshop, June 25, 2003

  37. Vidyanidhi database-approach… • The two records are linked by the ThesisID number-a unique id for the record • The bibliographic description of Vidyanidhi follows the ThesisMS Dublin Core standard adopted by the NDLTD and OCLC Indo-US Workshop, June 25, 2003

  38. Vidyanidhi - Platform • Microsoft • Windows XP supports all the 10 Indic scripts • Using Windows Glyph processing– • Open Type Font Format • Uniscribe-Unicode Script Processor • Open Type Layout Services library Indo-US Workshop, June 25, 2003

  39. Vidaynidhi - platform • MS SQL 2000 • A truly multilingual-capable SQL • Achieves satisfactory collation • Front End- ASP • Java script Indo-US Workshop, June 25, 2003

  40. Indo-US Workshop, June 25, 2003

  41. Vidyanidhi:Accessing and Searching • One can search the Vidyanidhi Database either in - • In English ( Roman Script) • The integrated ( Master) database has metadata records for theses in all languages • Vernacular database has records of the specific language only Indo-US Workshop, June 25, 2003

  42. Two approaches-differences • one affords search in the English language and the other in the vernacular. • The first approach also provides for viewing records in Roman script for all theses-search output- that satisfy the conditions of the query and also an option for viewing records in vernacular script for theses in vernacular Indo-US Workshop, June 25, 2003

  43. The second approach- enables one to search only the vernacular database and thus is limited to records in that language. • However, this approach enables the search to be in the vernacular language and script Indo-US Workshop, June 25, 2003

  44. Indo-US Workshop, June 25, 2003

  45. Indo-US Workshop, June 25, 2003

  46. Indo-US Workshop, June 25, 2003

  47. Indo-US Workshop, June 25, 2003

  48. Indo-US Workshop, June 25, 2003

  49. Unicode and Indic Scripts • Vidyanidhi implementation dispels certain misconceptions and misconstructions about Unicode • Supposed problems- • Data Input • Display and printing • Collation Indo-US Workshop, June 25, 2003

  50. Data input/Keyboard layout Our Test bed and comparison with other methods: • Unicode layout is as easy as the other in terms of speed • In terms of ‘no of key strokes’-No difference and some times Unicode method has less number of keystrokes involved • Data input was almost comparable to English records in terms of productivity Indo-US Workshop, June 25, 2003

More Related