260 likes | 503 Views
From UCS-2 to UTF-16. Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16. Why is this an issue?. The concept of the Unicode standard changed during its first few years Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M
E N D
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16
Why is this an issue? • The concept of the Unicode standard changed during its first few years • Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M • APIs and libraries need to follow this change and support the full range • Upcoming character assignments (Unicode 3.1, 2001) fall into the added range
“Unicode is a 16-bit character set” • Concept: 16-bit, fixed-width character set • Saving space by not including precomposed, rarely-used, obsolete, … characters • Compatibility, transition strategies, and acceptance forced loosening of these principles • Unicode 3.1: >90k assigned characters
16-bit APIs • APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2 • Assuming 1:1 character:code unit • Examples: Win32, Java, COM, ICU, Qt/KDE • Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols
Extending the range • Set aside two blocks of 1k 16-bit values, “surrogates”, for extension • 1k x 1k = 1M = 10000016 additional code points using a pair of code units • 16-bit form now variable-width UTF-16 • “Unicode scalar values” 0..10ffff16 • Proposed: 1994; part of Unicode 2.0 (1996)
Parallel with ISO-10646 • ISO-10646 uses 31-bit codes: UCS-4 • UCS-2: 16-bit codes for subset 0..ffff16 • UTF-16: transformation of subset 0..10ffff16 • UTF-8 covers all 31 bits • Private Use areas above 10ffff16 slated for removal from ISO-10646 for UTF interoperability and synchronization with Unicode
21-bit code points • Code points (“Unicode scalar values”) up to 10ffff16 use 21 bits • 16-bit code units still good for strings: variable-width like MBCS • Default string unit size not big enough for code points • Dual types for programming?
C: char/wchar_t dual types • C/C++ standards: dual types • Strings mostly with char units (8 bits) • Code points: wchar_t, 8..32 bits • Typical use in I18N-ed programs: (8-bit) char strings but (16/32-bit) wchar_t (or 32-bit int) characters; code point type is implementation-dependent
Unicode: dual types, too? • Strings could continue with 16-bit units • Single code points could get 32-bit data type • Dual-type model like C/C++ MBCS
Alternatives to dual 16/32 types • UTF-32: all types 32 bits wide, fixed-width • UTF-8: same complexity after range extension beyond just the BMP, closer to C/C++ model – byte-based • Use pairs of 16-bit units • Use strings for everything • Make string unit size flexible 8/16/32 bits
UCS-2 to UTF-32 • Fixed-width, single base type for strings and code points • UCS-2 programming assumptions mostly intact • Wastes at least 33% space, typically 50% • Performance bottleneck CPU - memory
UCS-2 to UTF-8 • UCS-2 programming assumes many characters in single code units • Breaks a lot of code • Same question of type for code points; follow C model, 32-bit wchar_t? – More difficult transition than other choices
Surrogate pairs for single chars • Caller avoids code point calculation • But: caller and callee need to detect and handle pairs: caller choosing argument values, callee checking for errors • Harder to use with code point constants because they are published as scalar values • Significant change for caller from using scalars
Strings for single chars • Always pass in string (and offset) • Most general, handles graphemes in addition to code points • Harder to use with code point constants because they are published as scalar values • Significant change for caller from using scalars
UTF-flexible • In principle, if the implementation can handle variable-width, MBCS-style strings, could it handle any UTF-size as a compile-time choice? • Adds interoperability with UTF-8/32 APIs • Almost no assumptions possible • Complexity of transition even higher than of transition to pure UTF-8, performance?
Interoperability • Break existing API users no more than necessary • Interoperability with other APIs: Win32, Java, COM, now also XML DOM • UTF-16 is Unicode default: good compromise (speed/ease/space) • String units should stay 16 bits wide
Does everything need to change? • String operations: search, substring, concatenation, … work with any UTF without change • Character property lookup and similar: need to support the extended range • Formatting: should handle more code points or even graphemes • Careful evaluation of all public APIs
ICU: some of all • Strings: UTF-16, UChar type remains 16-bit • New UChar32 for code points • Provide macros for C to deal with all UTFs: iteration, random access, … • C++ CharacterIterator: many new functions • Property lookup/low-level: UChar32 • Formatting: strings for graphemes
Scalar code points:property lookup • Old, 16-bit:UChar u_tolower(UChar c){ u[v[c15..7]+c6..0];} • New, 21-bit:UChar32 u_tolower(UChar32 c){ u[v[w[c20..10]+c9..4]+c3..0];}
Formatting: grapheme strings • Old:void setDecimalSymbol(UChar c); • New:void setDecimalSymbol(const UnicodeString &s);
Codepage conversion • To Unicode: results are one or two UTF-16 code units, surrogates stored directly in the conversion table • From Unicode: triple-stage compact array access from 21-bit code points like property lookup • Single-character-conversion to Unicode now returns UChar32 values
API first… • Tools and basic functions and classes are in place (property lookup, conversion, iterators, BiDi) • Public APIs reviewed and changed (“luxury” of early project stage) or deprecated and superseded by new versions • Higher-level implementations to follow before Unicode 3.1 published
More implementations follow… • Collation: need to prepare for >64k primary keys • Normalization and Transliteration • Word/Sentence break iteration • Etc. • No non-BMP data before Unicode 3.1 is stable
Other libraries • Java: planning stage for transition • Win32: rendering and UniScribe API largely UTF-16-ready • Linux: standardizing on 32-bit Unicode wchar_t, has UTF-8 locales like other Unixes for char* APIs • W3C: standards assume full UTF-16 range
Summary • Transition from UCS-2 to UTF-16 gains importance after four years of standard • APIs for single characters need change or new versions • String APIs: no change • Implementations need to handle 21-bit code points • Range of options
Resources • Unicode FAQ: http://www.unicode.org/unicode/faq/ • Unicode on IBM developerWorks: http://www.ibm.com/developer/unicode/ • ICU: http://oss.software.ibm.com/icu/