1 / 25

What s New in Globalization

. The Unicode Standard, Version 5.0.

arvin
Download Presentation

What s New in Globalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. What’s New in Globalization? Mark Davis President & Cofounder The Unicode Consortium

    2. The Unicode Standard, Version 5.0 “Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.” — Donald E. Knuth “For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users.” — Bill Gates “The path W3C follows to making text on the Web truly global is Unicode.” — Sir Tim Berners-Lee, KBE “Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world.” — James Gosling Foreword to The Unicode Standard, Version 5.0 Without much fanfare, Unicode has completely transformed the foundation of software and communications over the past decade. Whenever you read or write anything on a computer, you’re using Unicode. Whenever you search on Google, Yahoo!, MSN, Wikipedia, or many other Web sites, you’re using Unicode. Unicode 5.0 marks a major milestone in providing people everywhere the ability to use their own languages on computers. We began Unicode with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. Those existing legacy character encodings were both incomplete and inconsistent: Two encodings could use the same internal codes for two different characters and use different internal codes for the same characters; none of the encodings handled any more than a small fraction of the world’s languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs were hard-coded to support particular encodings, making development of international versions expensive, testing a nightmare, and support costs prohibitive. As a result, product launches in foreign markets were expensive and late—unsatisfactory both for companies and their customers. Developing countries were especially hard-hit; it was not feasible to support smaller markets. Technical fields such as mathematics were also disadvantaged; they were forced to use special fonts to represent arbitrary characters, but when those fonts were unavailable, the content became garbled. Unicode changed that situation radically. Now, for all text, programs only need to use a single representation—one that supports all the world’s languages. Programs could be easily structured with all translatable material separated from the program code and put into a single representation, providing the basis for rapid deployment in multiple languages. Thus, multiple-language versions of a program can be developed almost simultaneously at a much smaller incremental cost, even for complex programs like Microsoft Office or OpenOffice. The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. They give programmers extensive descriptions and a vast amount of data about how characters function: how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia; and how to deal with security concerns regarding the many “look-alike” characters from alphabets around the world. Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible. With the rise of the Web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicode—every program handling XML must use Unicode internally. The search engines all use Unicode for good reason; even if a Web page is in a legacy character encoding, the only effective way to index that page for searching is to translate it into the lingua franca, Unicode. All of the text on the Web thus can be stored, searched, and matched with the same program code. Since all of the search engines translate Web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place. This edition of The Unicode Standard, Version 5.0, supersedes and obsoletes all previous versions of the standard. The book is smaller in size, less expensive, and yet has hundreds of pages of new material and hundreds more of revised material. Like any human enterprise, Unicode is not without its flaws, of course. This book will help you work around some of the “gotchas” introduced into Unicode over the course of its development. Importantly, it will help you to understand which features may change in the future, and which cannot, so that you can appropriately optimize your implementations. You will also find a wealth of other information on the Unicode Web site (www.unicode.org). If you are interested in having a voice in determining directions for future development of Unicode, or want to follow closely the ongoing work, you will find information there on joining the Consortium. What you have in your hands is the culmination of many years of experience from experts around the globe. I am sure you will find it very useful. MARK DAVIS, Ph.D. President The Unicode ConsortiumForeword to The Unicode Standard, Version 5.0 Without much fanfare, Unicode has completely transformed the foundation of software and communications over the past decade. Whenever you read or write anything on a computer, you’re using Unicode. Whenever you search on Google, Yahoo!, MSN, Wikipedia, or many other Web sites, you’re using Unicode. Unicode 5.0 marks a major milestone in providing people everywhere the ability to use their own languages on computers. We began Unicode with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. Those existing legacy character encodings were both incomplete and inconsistent: Two encodings could use the same internal codes for two different characters and use different internal codes for the same characters; none of the encodings handled any more than a small fraction of the world’s languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs were hard-coded to support particular encodings, making development of international versions expensive, testing a nightmare, and support costs prohibitive. As a result, product launches in foreign markets were expensive and late—unsatisfactory both for companies and their customers. Developing countries were especially hard-hit; it was not feasible to support smaller markets. Technical fields such as mathematics were also disadvantaged; they were forced to use special fonts to represent arbitrary characters, but when those fonts were unavailable, the content became garbled. Unicode changed that situation radically. Now, for all text, programs only need to use a single representation—one that supports all the world’s languages. Programs could be easily structured with all translatable material separated from the program code and put into a single representation, providing the basis for rapid deployment in multiple languages. Thus, multiple-language versions of a program can be developed almost simultaneously at a much smaller incremental cost, even for complex programs like Microsoft Office or OpenOffice. The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. They give programmers extensive descriptions and a vast amount of data about how characters function: how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia; and how to deal with security concerns regarding the many “look-alike” characters from alphabets around the world. Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible. With the rise of the Web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicode—every program handling XML must use Unicode internally. The search engines all use Unicode for good reason; even if a Web page is in a legacy character encoding, the only effective way to index that page for searching is to translate it into the lingua franca, Unicode. All of the text on the Web thus can be stored, searched, and matched with the same program code. Since all of the search engines translate Web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place. This edition of The Unicode Standard, Version 5.0, supersedes and obsoletes all previous versions of the standard. The book is smaller in size, less expensive, and yet has hundreds of pages of new material and hundreds more of revised material. Like any human enterprise, Unicode is not without its flaws, of course. This book will help you work around some of the “gotchas” introduced into Unicode over the course of its development. Importantly, it will help you to understand which features may change in the future, and which cannot, so that you can appropriately optimize your implementations. You will also find a wealth of other information on the Unicode Web site (www.unicode.org). If you are interested in having a voice in determining directions for future development of Unicode, or want to follow closely the ongoing work, you will find information there on joining the Consortium. What you have in your hands is the culmination of many years of experience from experts around the globe. I am sure you will find it very useful. MARK DAVIS, Ph.D.PresidentThe Unicode Consortium

    3. The Unicode Standard, Version 5.0 Obsoletes previous versions Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few. Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes Systematic framework for improved text processing Improvements to the Unicode Encoding Model for UTF-8, … Rigorous stability of case folding and identifiers Improved interoperability and backward compatibility Enabling additional new ways to optimize code

    4. U5.0 Unicode Character Database Unicode: far more than a list of characters Properties: key to how characters function Changes in 5.0 Scripts: Unassigned code points ? Zzzz Casing Stability: Upper ? folded BIDI: Consistent Bidi_Mirrored Now Normative: kIICore Line Break: SE Asian ? Complex_Context New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties

    5. U5.0 Conformance Stable Case-Folded ˜ Upper ? Lower Much clearer encoding / property model Stable Approved Named Character Sequences Bengali, Gurmukhi, Tamil changes Combining grapheme joiner clarified Disunification of Diacritics

    6. 5.0 Annexes: Core UAX #9: Bidirectional Algorithm Tightened conformance requirements UAX #15: Unicode Normalization Forms New Stream-Safe Text Format Appendix of characters requiring special handling Expanded info on stability guarantees Additional detailed figures, guidelines UAX #31: Identifier and Pattern Syntax Added profiles & information on usage

    7. U5.0 Annexes: Boundaries UAX #14: Line Breaking Properties Rules modified to improve behavior Now Normative (conformance clauses reorganized) UAX #29: Text Boundaries Edge cases improved Tailorings for text boundaries now in Unicode CLDR Format of the rules changed to ease implementation Additional guidelines on regex, identifiers,…

    8. U5.0 Characters by Script

    9. Unicode Character Timeline

    10. Unicode Guide for Programmers Adjunct to Standard Concise Guide for Software Globalization Crucial Concepts Key “Gotchas” Recognize and Avoid Details on Encoding & conversions: UTF-8, 16, 32 & BOM Using character properties Text Operations

    11. Unicode Common Locale Data Repository: CLDR Key locale data for world languages Most extensive standard repository of locale data XML format

    12. Unicode CLDR 1.4 121 languages and 142 territories – 360 locales in all 25% more locale data; over 17,000 new/modified items Repository separated into language vs locale data Language-specific segmentation (word/line breaks…) Transliterations (eg ???????? ? Elleniká) Data for lenient date/time formatting and parsing Programmer asks for “numeric day” + “abbreviated month” Best format pattern returned, eg “dd.MMM” + Quarters in dates (eg 2006Q1) BCP 47 compatibility + extensions

    13. BCP 47 Language Tags Usage: HTTP, HTML, XML; CLDR Locale IDs… RFC 4646; Obsoletes RFCs 1766, 3066 Addresses problems in RFC3066 ISO standards: stability / accessibility / ambiguity Parseability, Extensibility; Registration speed Identification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

    14. Unicode Security Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’… Non visual problems: buffer overflows, non-shortest form,… UTR# 36 Unicode Security Considerations Guidelines & Recommendations UTS# 39. Unicode Security Mechanisms Algorithms & Data Limitations on Repertoire Testing for Confusables

    15. Internationalized Domain Names One instance of broad problem Many RFCs use Nameprep – limited to Unicode 3.2 Unicode recommendations Narrow the repertoire: exclude symbols, punctuation Expand the coverage: currently only Unicode 3.2. IETF idn-nextsteps published Some positive developments, but misreads Unicode, needs more work

    16. URL ? IRI International Resource Identifier (IRI) UTF-8, %-escaped Example: http://w3.org/International/articles/idn-and-iri/ JP??/??????.html http://w3.org/International/articles/idn-and-iri/ JP%E7%B4%8D... %E8%B1%86.html See http://ietf.org/rfc/rfc3987.txt

    17. Ideographic Variation Database U+82A6 ashi: multiple forms The first occurrence – any glyph Second occurrence is in the name of the town Ashiya – customarily displayed with form #4 Registration for variants

    18. Ideographic Variation Database Variation Selector Identifies a restriction on the appearance of a character Character + Variation Selector = Variation Sequence Han ideographs Impossible to build a single collection for everyone: requirements from scholars, governments and publishers… Instead, registration of multiple independent collections Unicode Ideographic Variation Database A given variation sequence is used in at most one collection Makes interchange of variation sequences reliable. Registration, not Assessment

    19. ICU 3.6 Mature, portable C/C++/Java int’l libraries Unicode 5.0, UCA 5.0, CLDR 1.4 ICU4C Charset Detection Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,… ICU4J Globalization Preferences Flexible date/time formats*, Charset conversion*

    20. Near-Term Issues Unicode 5.0.1, Unicode 5.1 CLDR / BCP 47bis LDAP Collation Registry IANA Charset Registry

    21. Unicode 5.1 - possibilities Characters CJK Unified Ideographs Extension C Minority Scripts: Cham and Lanna Malayalam chillu … Properties/Behavior Normalization process for stable strings …

    22. CLDR 1.5 / BCP 47bis CLDR 1.5 Data Submission Starting November New structures / data BCP 47 Adding ~7,000 (!) new language subtags Possibly other changes…

    23. LDAP Now has definitive comparison (good) Stuck at Unicode 3.2 (bad) http://www.ietf.org/rfc/rfc4518.txt

    24. Collation Registry Nearing approval Adds ability to register comparisons Workable for basic cases http://www.ietf.org/internet-drafts/ draft-newman-i18n-comparator-14.txt

    25. IANA Charset registry Currently limited usefulness Ill-defined Missing mapping tables Incomplete Inaccurate Regime Change Hope for future improvements!

    26. What’s New in Globalization? Mark Davis President & Cofounder The Unicode Consortium

More Related