E N D
1. Whats New in Globalization? Mark Davis
President & CofounderThe Unicode Consortium
2. The Unicode Standard, Version 5.0 Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years. Donald E. Knuth
For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users. Bill Gates
The path W3C follows to making text on the Web truly global is Unicode. Sir Tim Berners-Lee, KBE
Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world. James Gosling Foreword to The Unicode Standard, Version 5.0
Without much fanfare, Unicode has completely transformed the foundation of software and communications over the past decade. Whenever you read or write anything on a computer, youre using Unicode. Whenever you search on Google, Yahoo!, MSN, Wikipedia, or many other Web sites, youre using Unicode. Unicode 5.0 marks a major milestone in providing people everywhere the ability to use their own languages on computers.
We began Unicode with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. Those existing legacy character encodings were both incomplete and inconsistent: Two encodings could use the same internal codes for two different characters and use different internal codes for the same characters; none of the encodings handled any more than a small fraction of the worlds languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs were hard-coded to support particular encodings, making development of international versions expensive, testing a nightmare, and support costs prohibitive. As a result, product launches in foreign markets were expensive and lateunsatisfactory both for companies and their customers. Developing countries were especially hard-hit; it was not feasible to support smaller markets. Technical fields such as mathematics were also disadvantaged; they were forced to use special fonts to represent arbitrary characters, but when those fonts were unavailable, the content became garbled.
Unicode changed that situation radically. Now, for all text, programs only need to use a single representationone that supports all the worlds languages. Programs could be easily structured with all translatable material separated from the program code and put into a single representation, providing the basis for rapid deployment in multiple languages. Thus, multiple-language versions of a program can be developed almost simultaneously at a much smaller incremental cost, even for complex programs like Microsoft Office or OpenOffice.
The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. They give programmers extensive descriptions and a vast amount of data about how characters function: how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia; and how to deal with security concerns regarding the many look-alike characters from alphabets around the world. Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible.
With the rise of the Web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicodeevery program handling XML must use Unicode internally. The search engines all use Unicode for good reason; even if a Web page is in a legacy character encoding, the only effective way to index that page for searching is to translate it into the lingua franca, Unicode. All of the text on the Web thus can be stored, searched, and matched with the same program code. Since all of the search engines translate Web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place.
This edition of The Unicode Standard, Version 5.0, supersedes and obsoletes all previous versions of the standard. The book is smaller in size, less expensive, and yet has hundreds of pages of new material and hundreds more of revised material. Like any human enterprise, Unicode is not without its flaws, of course. This book will help you work around some of the gotchas introduced into Unicode over the course of its development. Importantly, it will help you to understand which features may change in the future, and which cannot, so that you can appropriately optimize your implementations. You will also find a wealth of other information on the Unicode Web site (www.unicode.org). If you are interested in having a voice in determining directions for future development of Unicode, or want to follow closely the ongoing work, you will find information there on joining the Consortium.
What you have in your hands is the culmination of many years of experience from experts around the globe. I am sure you will find it very useful.
MARK DAVIS, Ph.D.PresidentThe Unicode ConsortiumForeword to The Unicode Standard, Version 5.0
Without much fanfare, Unicode has completely transformed the foundation of software and communications over the past decade. Whenever you read or write anything on a computer, youre using Unicode. Whenever you search on Google, Yahoo!, MSN, Wikipedia, or many other Web sites, youre using Unicode. Unicode 5.0 marks a major milestone in providing people everywhere the ability to use their own languages on computers.
We began Unicode with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. Those existing legacy character encodings were both incomplete and inconsistent: Two encodings could use the same internal codes for two different characters and use different internal codes for the same characters; none of the encodings handled any more than a small fraction of the worlds languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs were hard-coded to support particular encodings, making development of international versions expensive, testing a nightmare, and support costs prohibitive. As a result, product launches in foreign markets were expensive and lateunsatisfactory both for companies and their customers. Developing countries were especially hard-hit; it was not feasible to support smaller markets. Technical fields such as mathematics were also disadvantaged; they were forced to use special fonts to represent arbitrary characters, but when those fonts were unavailable, the content became garbled.
Unicode changed that situation radically. Now, for all text, programs only need to use a single representationone that supports all the worlds languages. Programs could be easily structured with all translatable material separated from the program code and put into a single representation, providing the basis for rapid deployment in multiple languages. Thus, multiple-language versions of a program can be developed almost simultaneously at a much smaller incremental cost, even for complex programs like Microsoft Office or OpenOffice.
The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. They give programmers extensive descriptions and a vast amount of data about how characters function: how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia; and how to deal with security concerns regarding the many look-alike characters from alphabets around the world. Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible.
With the rise of the Web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicodeevery program handling XML must use Unicode internally. The search engines all use Unicode for good reason; even if a Web page is in a legacy character encoding, the only effective way to index that page for searching is to translate it into the lingua franca, Unicode. All of the text on the Web thus can be stored, searched, and matched with the same program code. Since all of the search engines translate Web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place.
This edition of The Unicode Standard, Version 5.0, supersedes and obsoletes all previous versions of the standard. The book is smaller in size, less expensive, and yet has hundreds of pages of new material and hundreds more of revised material. Like any human enterprise, Unicode is not without its flaws, of course. This book will help you work around some of the gotchas introduced into Unicode over the course of its development. Importantly, it will help you to understand which features may change in the future, and which cannot, so that you can appropriately optimize your implementations. You will also find a wealth of other information on the Unicode Web site (www.unicode.org). If you are interested in having a voice in determining directions for future development of Unicode, or want to follow closely the ongoing work, you will find information there on joining the Consortium.
What you have in your hands is the culmination of many years of experience from experts around the globe. I am sure you will find it very useful.
MARK DAVIS, Ph.D.PresidentThe Unicode Consortium
3. The Unicode Standard, Version 5.0 Obsoletes previous versions
Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few.
Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes
Systematic framework for improved text processing
Improvements to the Unicode Encoding Model for UTF-8,
Rigorous stability of case folding and identifiers
Improved interoperability and backward compatibility
Enabling additional new ways to optimize code
4. U5.0 Unicode Character Database Unicode: far more than a list of characters
Properties: key to how characters function
Changes in 5.0
Scripts: Unassigned code points ? Zzzz
Casing Stability: Upper ? folded
BIDI: Consistent Bidi_Mirrored
Now Normative: kIICore
Line Break: SE Asian ? Complex_Context
New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties
5. U5.0 Conformance Stable Case-Folded
Upper ? Lower
Much clearer encoding / property model
Stable Approved Named Character Sequences
Bengali, Gurmukhi, Tamil changes
Combining grapheme joiner clarified
Disunification of Diacritics
6. 5.0 Annexes: Core UAX #9: Bidirectional Algorithm
Tightened conformance requirements
UAX #15: Unicode Normalization Forms
New Stream-Safe Text Format
Appendix of characters requiring special handling
Expanded info on stability guarantees
Additional detailed figures, guidelines
UAX #31: Identifier and Pattern Syntax
Added profiles & information on usage
7. U5.0 Annexes: Boundaries UAX #14: Line Breaking Properties
Rules modified to improve behavior
Now Normative (conformance clauses reorganized)
UAX #29: Text Boundaries
Edge cases improved
Tailorings for text boundaries now in Unicode CLDR
Format of the rules changed to ease implementation
Additional guidelines on regex, identifiers,
8. U5.0 Characters by Script
9. Unicode Character Timeline
10. Unicode Guide for Programmers Adjunct to Standard
Concise Guide for Software Globalization
Crucial Concepts
Key Gotchas
Recognize and Avoid
Details on
Encoding & conversions:
UTF-8, 16, 32 & BOM
Using character properties
Text Operations
11. Unicode Common Locale Data Repository: CLDR Key locale data for world languages
Most extensive standard repository of locale data
XML format
12. Unicode CLDR 1.4 121 languages and 142 territories 360 locales in all
25% more locale data; over 17,000 new/modified items
Repository separated into language vs locale data
Language-specific segmentation (word/line breaks
)
Transliterations (eg ???????? ? Elleniká)
Data for lenient date/time formatting and parsing
Programmer asks for numeric day + abbreviated month
Best format pattern returned, eg dd.MMM
+ Quarters in dates (eg 2006Q1)
BCP 47 compatibility + extensions
13. BCP 47 Language Tags Usage: HTTP, HTML, XML; CLDR Locale IDs
RFC 4646; Obsoletes RFCs 1766, 3066
Addresses problems in RFC3066
ISO standards: stability / accessibility / ambiguity
Parseability, Extensibility; Registration speed
Identification of script (where necessary):
Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.
14. Unicode Security Examples:
Visual Confusables: paypal.com with Cyrillic a
Non visual problems: buffer overflows, non-shortest form,
UTR# 36 Unicode Security Considerations
Guidelines & Recommendations
UTS# 39. Unicode Security Mechanisms
Algorithms & Data
Limitations on Repertoire
Testing for Confusables
15. Internationalized Domain Names One instance of broad problem
Many RFCs use Nameprep limited to Unicode 3.2
Unicode recommendations
Narrow the repertoire: exclude symbols, punctuation
Expand the coverage: currently only Unicode 3.2.
IETF idn-nextsteps published
Some positive developments, but misreads Unicode, needs more work
16. URL ? IRI International Resource Identifier (IRI)
UTF-8, %-escaped
Example:
http://w3.org/International/articles/idn-and-iri/JP??/??????.html
http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D... %E8%B1%86.html
See http://ietf.org/rfc/rfc3987.txt
17. Ideographic Variation Database U+82A6 ashi: multiple forms
The first occurrence any glyph
Second occurrence is in the name of the town Ashiya customarily displayed with form #4
Registration for variants
18. Ideographic Variation Database Variation Selector
Identifies a restriction on the appearance of a character
Character + Variation Selector = Variation Sequence
Han ideographs
Impossible to build a single collection for everyone: requirements from scholars, governments and publishers
Instead, registration of multiple independent collections
Unicode Ideographic Variation Database
A given variation sequence is used in at most one collection
Makes interchange of variation sequences reliable.
Registration, not Assessment
19. ICU 3.6 Mature, portable C/C++/Java intl libraries
Unicode 5.0, UCA 5.0, CLDR 1.4
ICU4C
Charset Detection
Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,
ICU4J
Globalization Preferences
Flexible date/time formats*, Charset conversion*
20. Near-Term Issues Unicode 5.0.1, Unicode 5.1
CLDR / BCP 47bis
LDAP
Collation Registry
IANA Charset Registry
21. Unicode 5.1 - possibilities Characters
CJK Unified Ideographs Extension C
Minority Scripts: Cham and Lanna
Malayalam chillu
Properties/Behavior
Normalization process for stable strings
22. CLDR 1.5 / BCP 47bis CLDR 1.5
Data Submission Starting November
New structures / data
BCP 47
Adding ~7,000 (!) new language subtags
Possibly other changes
23. LDAP Now has definitive comparison (good)
Stuck at Unicode 3.2 (bad)
http://www.ietf.org/rfc/rfc4518.txt
24. Collation Registry Nearing approval
Adds ability to register comparisons
Workable for basic cases
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-14.txt
25. IANA Charset registry Currently limited usefulness
Ill-defined
Missing mapping tables
Incomplete
Inaccurate
Regime Change
Hope for future improvements!
26. Whats New in Globalization? Mark Davis
President & CofounderThe Unicode Consortium