210 likes | 227 Views
Unicode. Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect 2003-09-24. Unique number for every character. Universal Character Encoding. …. Unifies all Languages. 96 thousand characters, so far All characters accessible at the same time, in the same document:
E N D
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect 2003-09-24
Unique number for every character Universal Character Encoding …
Unifies all Languages • 96 thousand characters, so far • All characters accessible at the same time, in the same document: A, Ž, Ш, Δ, ش, क, க, ಔ,… か, 上, 각, …..
Lingua Franca for Computers • Developed & supported by industry leaders: • Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, … • Required by modern standards: • XML, HTML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, Perl, etc. • Implemented in: • All modern operating systems, browsers, and other products
International Domain Names • Approved - Unicode-Based • Examples: • http://Юникод.com • http://Βαλκανίων.com • http://हमसब.com
Standard Resources • www.unicode.org • Online Standard • Technical Reports • FAQs • General Information • Discussion Forums, Conferences
Programming Resources • System APIs: • Windows, Java, Unix, Oracle, DB2, Sybase, Mac, Linux, … • Languages • Java, JavaScript, C#, Perl 5.6.0, C, C++, SQL, … • Cross-platform libraries: • ICU, Rosette, …
Stability • Developers / other standards need absolute stability • Characters are never moved or deleted • Ordering of characters is by collation, not binary order. See UTS #10: Unicode Collation Algorithm • Characters may be deprecated (discouraged). • Characters never change names • Annotations are used to clarify usage • See Unicode Policies
Indic Support in Unicode • ISCII the basis for characters and allocation • Consortium actively engaged with Indian Government, which is a member • Welcomes addition of missing characters (e.g. Vedic), clarifications or corrections of usage
Structural Similarities with ISCII • Within script, layout and contents nearly identical • Independent + dependent vowels • Halant model for representing conjuncts • conjuncts / half-forms not directly encoded • represented by sequences instead • Phonetic sequence – order in syllables
Structural Differences with ISCII • Unicode is stateless: • No shifting to get different scripts • Each character has a unique number • Unicode is uniform: • No extension bytes necessary • All characters coded in the same space
Additional Characters • Indian Government is developing proposals for: • Additions of missing characters: • Vedic • Individual characters for certain scripts • Annotations and Descriptions
Global Applications now support languages of India • Companies supporting Indic with Unicode • OpenType fonts • Font support for Indic • Microsoft Windows • Java (IBM contributed ICU Indic Layout) • Linux • …
Benefits for India • All documents, anywhere in the world, can have Indic text • Allows seamless multilingual documents in India • including scriptures and minority languages • Opens up software export market, beyond English • Connects India to the world
How India Can Contribute • Effective Communication with the Unicode Consortium • Provide Resources for Development • Descriptions of Usage • Descriptions of Character Shaping • Transliteration Tables from Script to Script • Collation Information • OpenType fonts • …
What Developers Can Do • Interwork with existing ISCII systems • Move to Unicode for future developments • Java, Windows, Linux, …
The Future • The world is moving rapidly to Unicode • Unicode makes India open to the world • The world comes to you, and • You go to the world • You can help
Multiple Forms • UTF-8: maximal compatibility with 8-bit systems • UTF-16: good storage, interoperability with Windows/Java • UTF-32: simplest processing • Fast, lossless conversion • See Forms of Unicode