1 / 26

From UCS-2 to UTF-16

From UCS-2 to UTF-16. Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16. Why is this an issue?. The concept of the Unicode standard changed during its first few years Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M

lavi
Download Presentation

From UCS-2 to UTF-16

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16

  2. Why is this an issue? • The concept of the Unicode standard changed during its first few years • Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M • APIs and libraries need to follow this change and support the full range • Upcoming character assignments (Unicode 3.1, 2001) fall into the added range

  3. “Unicode is a 16-bit character set” • Concept: 16-bit, fixed-width character set • Saving space by not including precomposed, rarely-used, obsolete, … characters • Compatibility, transition strategies, and acceptance forced loosening of these principles • Unicode 3.1: >90k assigned characters

  4. 16-bit APIs • APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2 • Assuming 1:1 character:code unit • Examples: Win32, Java, COM, ICU, Qt/KDE • Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols

  5. Extending the range • Set aside two blocks of 1k 16-bit values, “surrogates”, for extension • 1k x 1k = 1M = 10000016 additional code points using a pair of code units • 16-bit form now variable-width UTF-16 • “Unicode scalar values” 0..10ffff16 • Proposed: 1994; part of Unicode 2.0 (1996)

  6. Parallel with ISO-10646 • ISO-10646 uses 31-bit codes: UCS-4 • UCS-2: 16-bit codes for subset 0..ffff16 • UTF-16: transformation of subset 0..10ffff16 • UTF-8 covers all 31 bits • Private Use areas above 10ffff16 slated for removal from ISO-10646 for UTF interoperability and synchronization with Unicode

  7. 21-bit code points • Code points (“Unicode scalar values”) up to 10ffff16 use 21 bits • 16-bit code units still good for strings: variable-width like MBCS • Default string unit size not big enough for code points • Dual types for programming?

  8. C: char/wchar_t dual types • C/C++ standards: dual types • Strings mostly with char units (8 bits) • Code points: wchar_t, 8..32 bits • Typical use in I18N-ed programs: (8-bit) char strings but (16/32-bit) wchar_t (or 32-bit int) characters; code point type is implementation-dependent

  9. Unicode: dual types, too? • Strings could continue with 16-bit units • Single code points could get 32-bit data type • Dual-type model like C/C++ MBCS

  10. Alternatives to dual 16/32 types • UTF-32: all types 32 bits wide, fixed-width • UTF-8: same complexity after range extension beyond just the BMP, closer to C/C++ model – byte-based • Use pairs of 16-bit units • Use strings for everything • Make string unit size flexible 8/16/32 bits

  11. UCS-2 to UTF-32 • Fixed-width, single base type for strings and code points • UCS-2 programming assumptions mostly intact • Wastes at least 33% space, typically 50% • Performance bottleneck CPU - memory

  12. UCS-2 to UTF-8 • UCS-2 programming assumes many characters in single code units • Breaks a lot of code • Same question of type for code points; follow C model, 32-bit wchar_t? – More difficult transition than other choices

  13. Surrogate pairs for single chars • Caller avoids code point calculation • But: caller and callee need to detect and handle pairs: caller choosing argument values, callee checking for errors • Harder to use with code point constants because they are published as scalar values • Significant change for caller from using scalars

  14. Strings for single chars • Always pass in string (and offset) • Most general, handles graphemes in addition to code points • Harder to use with code point constants because they are published as scalar values • Significant change for caller from using scalars

  15. UTF-flexible • In principle, if the implementation can handle variable-width, MBCS-style strings, could it handle any UTF-size as a compile-time choice? • Adds interoperability with UTF-8/32 APIs • Almost no assumptions possible • Complexity of transition even higher than of transition to pure UTF-8, performance?

  16. Interoperability • Break existing API users no more than necessary • Interoperability with other APIs: Win32, Java, COM, now also XML DOM • UTF-16 is Unicode default: good compromise (speed/ease/space) • String units should stay 16 bits wide

  17. Does everything need to change? • String operations: search, substring, concatenation, … work with any UTF without change • Character property lookup and similar: need to support the extended range • Formatting: should handle more code points or even graphemes • Careful evaluation of all public APIs

  18. ICU: some of all • Strings: UTF-16, UChar type remains 16-bit • New UChar32 for code points • Provide macros for C to deal with all UTFs: iteration, random access, … • C++ CharacterIterator: many new functions • Property lookup/low-level: UChar32 • Formatting: strings for graphemes

  19. Scalar code points:property lookup • Old, 16-bit:UChar u_tolower(UChar c){ u[v[c15..7]+c6..0];} • New, 21-bit:UChar32 u_tolower(UChar32 c){ u[v[w[c20..10]+c9..4]+c3..0];}

  20. Formatting: grapheme strings • Old:void setDecimalSymbol(UChar c); • New:void setDecimalSymbol(const UnicodeString &s);

  21. Codepage conversion • To Unicode: results are one or two UTF-16 code units, surrogates stored directly in the conversion table • From Unicode: triple-stage compact array access from 21-bit code points like property lookup • Single-character-conversion to Unicode now returns UChar32 values

  22. API first… • Tools and basic functions and classes are in place (property lookup, conversion, iterators, BiDi) • Public APIs reviewed and changed (“luxury” of early project stage) or deprecated and superseded by new versions • Higher-level implementations to follow before Unicode 3.1 published

  23. More implementations follow… • Collation: need to prepare for >64k primary keys • Normalization and Transliteration • Word/Sentence break iteration • Etc. • No non-BMP data before Unicode 3.1 is stable

  24. Other libraries • Java: planning stage for transition • Win32: rendering and UniScribe API largely UTF-16-ready • Linux: standardizing on 32-bit Unicode wchar_t, has UTF-8 locales like other Unixes for char* APIs • W3C: standards assume full UTF-16 range

  25. Summary • Transition from UCS-2 to UTF-16 gains importance after four years of standard • APIs for single characters need change or new versions • String APIs: no change • Implementations need to handle 21-bit code points • Range of options

  26. Resources • Unicode FAQ: http://www.unicode.org/unicode/faq/ • Unicode on IBM developerWorks: http://www.ibm.com/developer/unicode/ • ICU: http://oss.software.ibm.com/icu/

More Related