410 likes | 1.12k Views
Localization and Language Technology Standards. Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007. Outline. Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization
E N D
Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007
Outline • Character Encoding Standards • Fonts, Glyphs, Mapping Standards • OS/Browser Support, Drivers • Transliteration, Romanization • Translation, Linguistic Resources • Speech and OCR Technologies • Enforcement Kavi Narayana Murthy UoH
Goals • Functionality • Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease • Inter-operability, Platform Independence • All Applications must work seemlessly on all hardware and software platforms • Language and Script Independence • Multi-lingual, Multi-Script Support Kavi Narayana Murthy UoH
Standards • Even a poor standard is better than no standard • Standards save us a lot in the long run • Commercial forces promoting non-standard, proprietary, secret systems must not be allowed to succeed • Let us not say “Let the Market Decide”!!! Kavi Narayana Murthy UoH
Character Encoding Standards • ISCII and Unicode • ISCII is a BIS Standard, Unicode is not • Unicode is based on ISCII • In some sense, Unicode is a step in the backward direction • Let us understand ISCII first Kavi Narayana Murthy UoH
Language and Script • Do not confuse one for the other • Many-to-Many • Script is neither language nor font • Script and SuperScript • Phonetic Basis • Common SuperScript for all ILs • Script Grammar Kavi Narayana Murthy UoH
Language and Script • Sanskrit is written in Devanagari, Telugu, Kannada, Bangla etc. scripts • Devanagari is used for writing Sanskrit, Hindi, Marathi, etc. • English words are often written (transliterated) in local language scripts Kavi Narayana Murthy UoH
Phonetic Basis • Words: Meanings, Sounds, Written Symbols • Meanings are supreme but difficult to quantify and encode • Sounds are the next best • A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’ • No need for ‘Spellings’ • What is write is what we speak - directly Kavi Narayana Murthy UoH
Orthography • Written symbols correspond with phonemes – basic sound units • Minor variations in sounds (allophones, co-articulation effects etc.) are not depicted in orthography • t: Mountain, tea, truck, spilt, little • Special Symbols not to confused with basic Characters Kavi Narayana Murthy UoH
What is a Character? • Indian Languages: • No ‘alphabet’, not letters, no spellings • Phoneme-based • Units are syllable-like: called ‘akshara’-s • akshara-s very large in number • Corpus studies not sufficient • Made up of vowels, consonants etc. • Not all sequences valid Kavi Narayana Murthy UoH
Script Grammar • A Grammar for Scripts • Allows all valid sequences, only valid sequences • No need to code all possible akshara-s • Script grammar must be part of standards: ISCII includes. UNICODE? • Script Grammar to be enforced by s/w Kavi Narayana Murthy UoH
SuperScript • ILs: 10 Scripts with a nearly common sound system – all derived from the ancient ‘braahmi’ script • => SuperScript • Super Set of all Phonemes • Common encoding: ISCII • Extendable to all languages of the world Kavi Narayana Murthy UoH
ISCII: (BIS – 1991: IS 13194) • 128 codes more than sufficient • Uses second half of ASCII, first half untouched – allows mixing with English • SuperScript: Transliteration built-in • Long Standing: ISCII 1988, 1991 • Well thought and well designed Kavi Narayana Murthy UoH
Why did ISCII fail to catch on? • Silent on Character-to-Font mapping • A complex many-to-many mapping • Fonts not standardized, fonts not available • Not registered, no OS/Browser Support • (BIS – 1991: IS 13194) • Rationale not explained • Not publicized, not enforced Kavi Narayana Murthy UoH
History • Proprietary, non-standard, secret font based encoding schemes • Promoted by commercial companies • Near Zero Inter-operability • Ad-hoc ISCII-to-font mapping schemes • Mapping schemes not made public • To be made Illegal and Punishable • Put India back by at least a decade! Kavi Narayana Murthy UoH
Improving ISCII • Register - To get OS/Browser Support • Remove encoding of allophones, allographs • Script Grammar: FSM enough, CFG - not needed • Include Rationale, explanatory notes • Remove Attribute/Extension codes • Standardize ISCII-to-Font Mapping Scheme • Promote, Enforce Kavi Narayana Murthy UoH
Character-to-Font Mapping • Complex scripts – not linear • Glyphs: shape units convenient for rendering • Poor correspondence with sound units • Many-to-Many mappings • Glyph selection, scaling, positioning • No Glyph Encoding Standard Kavi Narayana Murthy UoH
From Character to Font • Must be provably complete and 100% consistent • Current systems are all ad-hoc – neither complete nor consistent • Finite State Transducers: • Necessary and Sufficient • Without restricting Creativity and Flexibility • Simple, Efficient, Re-Usable Kavi Narayana Murthy UoH
Encoding Standards: Unicode • For Language/Script/SuperScript? • CJK. Why not for ILs? • Script Grammar? • Character-to-Font: • relegated to font level • font effects • ISCII-88 Based, Has Errors • Once added, cannot be deleted! Kavi Narayana Murthy UoH
ISCII or Unicode? • Unicode: • To be with the World, to know and be known • ‘Correcting’ Mistakes, Improving Standards • Support (OS, Fonts, etc.), Education, Training • Converting Legacy Data – A Huge Task • ISCII-to-Unicode is not trivial • Ignore BIS Standard and embrace what is not yet ‘standardized’? • Why not co-exist? – Internal and External Views Kavi Narayana Murthy UoH
Keyboard Layouts, Drivers • Several de-facto standards and many variations in use • To select a few and standardize • So called Roman Phonetic Typing • ILs through English! • OK for oldies, not for future! • INSCRIPT: ISCII Standard, Good for new comers • To strictly enforce Script Grammar Kavi Narayana Murthy UoH
Document Encoding Standards • Plain Text: pure ISCII/UNICODE • Mono-lingual Plain Text? • Annotated Text (Ex. Word Processors) • XML Style, Open, Readable formats to be encouraged • Proprietary, secret, non-standard encodings must be discouraged Kavi Narayana Murthy UoH
Transliteration • Widely used, part of our Tradition • Sanskrit texts in local scripts • English, Hindi, Urdu words in local scripts • Music Compositions • Automatic in ISCII. Unicode? • Quality of transliteration • To and From English? Kavi Narayana Murthy UoH
Romanization • Need: • Where there is no support for local languages • English dailies, posters, advertisements etc. • Lack of support: OS/Browser/Fonts etc. • Where users prefer Roman • A variety of ad-hoc schemes in use • iTRANS, RTS, W-X, etc. • Standards badly wanted Kavi Narayana Murthy UoH
Romanization • Multi-dimensional optimization problem • Case Mix-up • 26 Letters not sufficient • 52 nearly sufficient • Not always supported • Storage space, Ease of Typing, Aesthetics • Scientific/Logical Design/Naturalness • English-like – for the oldies: a, ee, oo, a, oa ??? • Futuristic: aa/ii/uu/ee/oo Kavi Narayana Murthy UoH
Romanization • Clashes: a+u/au, k+h/kh, s’ • Two way conversion, cyclic check • Ex. Long Vowels: • a: -clashes with colon • diacritic –not supported • ipa –not understood –not supported • A +single char. +saves space –ugly –difficult to type –case-mix-up • aa +logical (like ee) +easy to type Kavi Narayana Murthy UoH
Romanization: An Example • a aa i ii u uu R RR e ee ai o oo au M H • k kh g gh n~ • c ch j jh n` • T TH D DH N • t th d dh n • p ph b bh m • y r l v s’ S s h L Kavi Narayana Murthy UoH
Translation • Create Material Afresh • Translate by Hand • Automatic/Machine Translation • Machine Aided Translation • English – Local Language Translation • Local – Local Language Translation Kavi Narayana Murthy UoH
Translation • Resource Intensive • Manpower, Time, Cost • Quality/Uniformity • Standards, Bench-Mark Data, Testing and Evaluation Procedures • Dictionaries, Terminology Databases • Pan-Indian Terms/Sanskritize/Localize Kavi Narayana Murthy UoH
Linguistic Resources • Dictionaries – General, Domain Specific • Terminological Databases • Thesauri, WordNets, Ontologies • Morphological Analyzers, Generators • Spell/Grammar/Style Checkers • Annotated Text and Speech Corpora Kavi Narayana Murthy UoH
India: Future is in Speech • One Billion People, A Sixth of the World • More than 150 Languages, 22 Recognized • 95 % not comfortable with English • Computers, Current, Connectivity • Info Revolution benefits: Majority Deprived • 10 M Computers, 100 M Phones • Future is in Speech Kavi Narayana Murthy UoH
Speech • Natural • Easy, Fast • Hands-Free • No need to Learn • Technology • Language • Available to all Kavi Narayana Murthy UoH
Text and Speech • Speech is Natural • Reading/Writing is learnt, Artificial • Some never learn – Illiterates • Oral Tradition • Speech is more permanent than Text! • “I did not steal that ring of gold” • Trust Yourself! Kavi Narayana Murthy UoH
Speech Technologies • Speech Recognition: Speech to Text • Speech Synthesis: Text to Speech • Speaker Recognition,Verification,ID • Speech Coding/Decoding, Compression • Slow down, Speed up • Speech as Evidence Kavi Narayana Murthy UoH
Applications • Telephone Dialing • Form Filling • Dictation Machine • Command and Control • Voice enabled Web • OCR+WP+TTS • MT: Cross-Lingual IR, S2S Kavi Narayana Murthy UoH
OCR • OCR in Local Scripts Needed • To digitize and save legacy data • To compile/process/edit/refine data • For Printed Texts/Manuscripts • Old Data • deterioration of paper • old type fonts, problems of type-setting Kavi Narayana Murthy UoH
Multi-Modal Interfaces • To Reach out to 1 Billion People, we must get the best of many worlds: • Speech Recognition and Synthesis • Graphics and iconic Interfaces • OCR Technologies • Translation, CLIR • Camera, Gestures, Touch Screen Kavi Narayana Murthy UoH
Balance • Between Backward Compatibility and Future-Proof Designs • Quick Fix Solutions and Long Haul • One Standard or Several? • Economics and Business Sense versus Social Responsibilities • Acceptance versus Enforcement Kavi Narayana Murthy UoH
The 3 Most Important Things 1. Develop/Refine/Update Standards • Detailed Documentation • Including rationale, issues, evaluation, etc. 2. Education and Training 3. Enforcement • Make use of non-standard methods illegal and punishable under law • Technical Workshops for detailing Kavi Narayana Murthy UoH
Thank You! Visit www.LanguageTechnologies.ac.in