ICU Character Conversion API

ICU Character Conversion API Markus Scherer ICU Team IBM Cupertino First ICU DeveloperWorkshop

Codepages • Codepages, character sets, etc. are collections of coded characters • For text exchange: byte-serialized, need to be able to get characters to and from byte stream • One character may need 1..4 bytes • Some stateful with SI/SO/ESC… First ICU DeveloperWorkshop

Unicode • Unicode is a coded character set • Its repertoire is superset of most other codepages’ repertoires • Several encoding schemes for exchange • ICU internal: UTF-16 with platform endianness • Internal use based on 16-bit units (UChars), not bytes (character encoding form) First ICU DeveloperWorkshop

ICU Codepage Conversion • ICU has one conversion API with several implementations • Each optimized for a type of encoding, transparent to user • All conversions are between internal UTF-16 UChars and external codepage bytes • Non-Unicode codepages need mapping tables: .ucm sources -> .cnv binary First ICU DeveloperWorkshop

ICU Conversion Capabilities • Support for several Unicode encodings: UTF-8, UTF-16 (either endianness) • Support for mapping tables for general encodings with 1..4 bytes per character • Support for mappings to/from surrogate pairs (Unicode above U+ffff) • Stateful: ISO-2022, EBCDIC MBCS • Lotus LMBCS First ICU DeveloperWorkshop

General Limitations • ICU converters only map each code point from one encoding to a code point in another encoding • No reordering or other transformations for different character models: directionality, composing chars, localized digits, vowel reordering, … • For such transformations, additionally use BiDi/Transliteration/Shaping APIs First ICU DeveloperWorkshop

ICU 1.6 Limitations • Missing Unicode encodings: UTF-32 • SCSU in separate API, not regular converter • ISO-2022 only “JP” country variant First ICU DeveloperWorkshop

Codepage Names and Aliases • Most codepages have several names • MIME, IANA: Name lists for Internet • IBM, MS: numeric names • Many OSes use own names • ICU: internal name + aliases • See icu/data/convrtrs.txt First ICU DeveloperWorkshop

ICU Conversion API • 3 main functions: • Streaming: ucnv_toUnicode(), ucnv_fromUnicode() • Forward character iteration: ucnv_getNextUChar() • Convenience functions for all-in-one conversion: ucnv_to/fromUChars(), etc. First ICU DeveloperWorkshop

Buffer Management • Streaming functions modify source & target pointers, try to read entire source & fill target • Allow to convert stream in chunks with multiple calls, converter object has state • Target full: U_BUFFER_OVERFLOW_ERROR • Source empty: no error or U_TRUNCATED_CHAR_FOUND (at end of stream) First ICU DeveloperWorkshop

C++ Conversion API • Wrapper around most of C API • Provides same basic streaming functions • Convenience functions for UnicodeString • More convenient from C++ • No getNextUChar(), no custom callbacks First ICU DeveloperWorkshop

Basic vs Convenience Function • Basic streaming functions allow • Conversion of arbitrarily large text with limited buffers • Offset mappings: corresponding source-target characters • Convenience functions are easier to use for single-buffer conversions, no offsets First ICU DeveloperWorkshop

Callbacks for Exceptions • Callback functions are called when the source is malformed (illegal sequence) or does not encode a character (unassigned) • Several callbacks provided by ICU for stopping with error, replacing with substitution character (default), … • User-customizable: set user callback function and handle exceptions First ICU DeveloperWorkshop

Fallback Mappings • Character sets often have different repertoires • Sometimes, if no precise mapping exists, a “good-enough” fallback mapping is ok • ucnv_setFallback() (default: no fallbacks) • XML/HTML e.g.: better use escape sequence like &#xx…x; First ICU DeveloperWorkshop

Default Converter • ICU default converter: name of ICU converter that matches system codepage • Can be changed – do as early as possible • Mismatch with system problematic, change may not affect default converter instances First ICU DeveloperWorkshop

Invariant Characters • Special encoding: common subset of codepages of a family • ASCII vs. EBCDIC • About 84 characters have same encoding within each family • Limited use for internal and syntactic strings where this is ok • Fast: no converter object First ICU DeveloperWorkshop

SCSU: Unicode Compression • Described in Unicode TR 6 • Byte-based, stateful, compact • Can approximate text size of special codepage • IANA-registered charset • ICU: separate API, very similar to streaming conversion functions • But: reads/writes only complete sequences First ICU DeveloperWorkshop

Buffering II • General conversion API will consume entire source if enough space in target, or fill entire target if enough source • Even if source/target characters are split • SCSU API consumes and writes only whole units, may leave source non-empty and target non-full First ICU DeveloperWorkshop

Streaming Conversion Loop • while(source available) { source buffer is empty, fill it do { to/fromUnicode(); write contents of target } while(buffer overflow); if(failure other than buffer overflow) { report error }} First ICU DeveloperWorkshop

Streaming Loop with SCSU • while(source available) { source buffer may not be empty, append to it [de]compress(); write contents of target if(failure other than buffer overflow) { report error } move rest of current source to start of buffer} First ICU DeveloperWorkshop

API Changes for ICU 1.6 • Streaming functions: at full target, used to set U_INDEX_OUTOFBOUNDS_ERROR, which is still used for insufficient input to ucnv_getNextUChar() • Callback API changed: new function signatures, hiding internal structures (Uconverter!), new helper functions First ICU DeveloperWorkshop

Future Enhancements • UTF-32 (either endianness) • SCSU as regular converter • More country variants for ISO-2022 • Collecting more precise mapping and alias tables First ICU DeveloperWorkshop

ICU Character Conversion API

ICU Character Conversion API

Presentation Transcript

Character

Character

Character

Character

ICU ICU ICU

API Integration, Travel API, XML API Integration, Travel API

Character

Character

character

Character

Character

Character

Character

CHARACTER

Character

ICU

Character

Character

Character

character

Character

Character