220 likes | 446 Views
ICU Character Conversion API. Markus Scherer ICU Team IBM Cupertino. Codepages. Codepages, character sets, etc. are collections of coded characters For text exchange: byte-serialized, need to be able to get characters to and from byte stream One character may need 1..4 bytes
E N D
ICU Character Conversion API Markus Scherer ICU Team IBM Cupertino First ICU DeveloperWorkshop
Codepages • Codepages, character sets, etc. are collections of coded characters • For text exchange: byte-serialized, need to be able to get characters to and from byte stream • One character may need 1..4 bytes • Some stateful with SI/SO/ESC… First ICU DeveloperWorkshop
Unicode • Unicode is a coded character set • Its repertoire is superset of most other codepages’ repertoires • Several encoding schemes for exchange • ICU internal: UTF-16 with platform endianness • Internal use based on 16-bit units (UChars), not bytes (character encoding form) First ICU DeveloperWorkshop
ICU Codepage Conversion • ICU has one conversion API with several implementations • Each optimized for a type of encoding, transparent to user • All conversions are between internal UTF-16 UChars and external codepage bytes • Non-Unicode codepages need mapping tables: .ucm sources -> .cnv binary First ICU DeveloperWorkshop
ICU Conversion Capabilities • Support for several Unicode encodings: UTF-8, UTF-16 (either endianness) • Support for mapping tables for general encodings with 1..4 bytes per character • Support for mappings to/from surrogate pairs (Unicode above U+ffff) • Stateful: ISO-2022, EBCDIC MBCS • Lotus LMBCS First ICU DeveloperWorkshop
General Limitations • ICU converters only map each code point from one encoding to a code point in another encoding • No reordering or other transformations for different character models: directionality, composing chars, localized digits, vowel reordering, … • For such transformations, additionally use BiDi/Transliteration/Shaping APIs First ICU DeveloperWorkshop
ICU 1.6 Limitations • Missing Unicode encodings: UTF-32 • SCSU in separate API, not regular converter • ISO-2022 only “JP” country variant First ICU DeveloperWorkshop
Codepage Names and Aliases • Most codepages have several names • MIME, IANA: Name lists for Internet • IBM, MS: numeric names • Many OSes use own names • ICU: internal name + aliases • See icu/data/convrtrs.txt First ICU DeveloperWorkshop
ICU Conversion API • 3 main functions: • Streaming: ucnv_toUnicode(), ucnv_fromUnicode() • Forward character iteration: ucnv_getNextUChar() • Convenience functions for all-in-one conversion: ucnv_to/fromUChars(), etc. First ICU DeveloperWorkshop
Buffer Management • Streaming functions modify source & target pointers, try to read entire source & fill target • Allow to convert stream in chunks with multiple calls, converter object has state • Target full: U_BUFFER_OVERFLOW_ERROR • Source empty: no error or U_TRUNCATED_CHAR_FOUND (at end of stream) First ICU DeveloperWorkshop
C++ Conversion API • Wrapper around most of C API • Provides same basic streaming functions • Convenience functions for UnicodeString • More convenient from C++ • No getNextUChar(), no custom callbacks First ICU DeveloperWorkshop
Basic vs Convenience Function • Basic streaming functions allow • Conversion of arbitrarily large text with limited buffers • Offset mappings: corresponding source-target characters • Convenience functions are easier to use for single-buffer conversions, no offsets First ICU DeveloperWorkshop
Callbacks for Exceptions • Callback functions are called when the source is malformed (illegal sequence) or does not encode a character (unassigned) • Several callbacks provided by ICU for stopping with error, replacing with substitution character (default), … • User-customizable: set user callback function and handle exceptions First ICU DeveloperWorkshop
Fallback Mappings • Character sets often have different repertoires • Sometimes, if no precise mapping exists, a “good-enough” fallback mapping is ok • ucnv_setFallback() (default: no fallbacks) • XML/HTML e.g.: better use escape sequence like &#xx…x; First ICU DeveloperWorkshop
Default Converter • ICU default converter: name of ICU converter that matches system codepage • Can be changed – do as early as possible • Mismatch with system problematic, change may not affect default converter instances First ICU DeveloperWorkshop
Invariant Characters • Special encoding: common subset of codepages of a family • ASCII vs. EBCDIC • About 84 characters have same encoding within each family • Limited use for internal and syntactic strings where this is ok • Fast: no converter object First ICU DeveloperWorkshop
SCSU: Unicode Compression • Described in Unicode TR 6 • Byte-based, stateful, compact • Can approximate text size of special codepage • IANA-registered charset • ICU: separate API, very similar to streaming conversion functions • But: reads/writes only complete sequences First ICU DeveloperWorkshop
Buffering II • General conversion API will consume entire source if enough space in target, or fill entire target if enough source • Even if source/target characters are split • SCSU API consumes and writes only whole units, may leave source non-empty and target non-full First ICU DeveloperWorkshop
Streaming Conversion Loop • while(source available) { source buffer is empty, fill it do { to/fromUnicode(); write contents of target } while(buffer overflow); if(failure other than buffer overflow) { report error }} First ICU DeveloperWorkshop
Streaming Loop with SCSU • while(source available) { source buffer may not be empty, append to it [de]compress(); write contents of target if(failure other than buffer overflow) { report error } move rest of current source to start of buffer} First ICU DeveloperWorkshop
API Changes for ICU 1.6 • Streaming functions: at full target, used to set U_INDEX_OUTOFBOUNDS_ERROR, which is still used for insufficient input to ucnv_getNextUChar() • Callback API changed: new function signatures, hiding internal structures (Uconverter!), new helper functions First ICU DeveloperWorkshop
Future Enhancements • UTF-32 (either endianness) • SCSU as regular converter • More country variants for ISO-2022 • Collecting more precise mapping and alias tables First ICU DeveloperWorkshop