360 likes | 541 Views
Bits of Unicode. Data structures for a large character set Mark Davis IBM Emerging Technologies. ☢ Caution ☢. “ Characters ” ambiguous, sometimes: Graphemes: “ x̣ ” (also “ ch ” , … ) Code points: 0078 0323 Code units: 0078 0323 (or UTF-8: 78 CC A3) For programmers
E N D
Bits of Unicode Data structures for alarge character set Mark Davis IBM Emerging Technologies
☢ Caution ☢ • “Characters” ambiguous, sometimes: • Graphemes: “x̣” (also “ch”,…) • Code points: 0078 0323 • Code units: 0078 0323 (or UTF-8: 78 CC A3) • For programmers • Unicode associates codepoints (or sequences of codepoints) with properties • See UTR#17
The Problem • Programs often have to do <key,value> lookups • Look up properties by codepoint • Map codepoints to values • Test codepoints for inclusion in set • e.g. value == true/false • Easy with 256 codepoints: just use array
Size Matters • Not so easy with Unicode! • Unicode 3.0 • subset (except PUA) • up to FFFF16 = 65,53510 • Unicode 3.1 • full range • up to 10FFFF16 = 1,114,11110
With ASCII Simple Fast Compact codepoint ➠ bit:32 bytes codepoint ➠ short:½ K With Unicode Simple Fast Huge (esp. v3.1) codepoint ➠ bit:136 K codepoint ➠ short:2.2 M Array Lookup
Further complications • Mappings, tests, properties often must be for sequences of codepoints. • Human languages don’t just use single codepoints. • “ch” in Spanish, Slovak; etc.
First step: Avoidance • Properties from libraries often suffice • Test for (Character.getType(c) == Nd)instead of long list of codepoints • Easier • Automatically updated with new versions • Data structures from libraries often suffice • Java Hashtable • ICU (Java or C++) CompactArray • JavaScript properties • Consult http://www.unicode.org
Data structures: criteria • Speed • Read (static) • Write (dynamic) • Startup • Memory footprint • Ram • Disk • Multi-threading
Hashtables • Advantages • Easy to use out-of-the-box • Reasonably fast • General • Disadvantages • High overhead • Discrete (no range lookup) • Much slower than array lookup
Overhead: char1 ➠ char2 overhead … overhead next hash key value overhead overhead char1 char2 …
Trie • Advantages • Nearly as fast as array lookup • Much smaller than arrays or Hashtables • Take advantage of repetition • Disadvantages • Not suited for rapidly changing data • Best for static, preformed data
Index … Data M1 M2 Codepoint Trie structure
M1 M2 Codepoint Trie code • 5 Operations • Shift, Lookup, Mask, Add, Lookup v = data[index[c>>S1]+(c&M2)]] S1
Trie: double indexed • Double, for more compaction: • Slightly slower than single index • Smaller chunks of data, so more compaction
… Index1 … Index2 … Data M1 M2 M3 Codepoint Trie: double indexed
M1 M2 M3 Codepoint Trie code: double indexed b1 = index1[ c >> S1 ] b2 = index2[ b1 + ((c >> S2) & M2)] v = data[ b2 + (c & M3) ] S1 S2
Inversion List • Compaction of set of codepoints • Advantages • Simple • Very compact • Faster write than trie • Very fast boolean operations • Disadvantages • Slower read than trie or hashtable
Inversion List Structure • Structure • Index (optional) • List of codepoints in ascending order • Example Set [ 0020-0061, 0135, 19A3-201B ] Index 0: 0020 in 1: 0062 out 0135 2: in 0136 3: out 19A3 4: in 201C 5: out
Inversion List Example • Find smallest i such that c < data[i] • If no i, i = length • Thenc ∈ List ↔ odd(i) • Examples: • In: 0023, 0135 • Out: 001A, 0136, A357 Index 0: 0020 in 1: 0062 out 0135 2: in 0136 3: out 19A3 4: in 201C 5: out
Index Index 0: 0020 0: 0000 1: 0062 1: 0020 0135 2: 3: 0062 0136 3: 0135 2: 19A3 4: 0136 4: 201C 5: 19A3 5: 201C 6: Inversion List Operations • Fast Boolean Operations • Example: Negation ➠ ➠
Inversion List: Binary Search • from Programming Pearls • Completely unrolled, precalculated parameters int index = startIndex; if (x >= data[auxStart]) { index += auxStart; } switch (power) { case 21: if (x < data[t = index-0x10000]) index = t; case 20: if (x < data[t = index-0x8000]) index = t; …
Index Inversion Map 0: 0020 1: 0062 0135 2: • Inversion Listplus • Associated Values • Lookup index just as in Inversion List • Take corresponding value 0136 3: 19A3 4: 201C 5: 0: 0 5 1: 3 2: 9 3: 8 4: 3 5: 6: 0
Key ➠ String Value • Problem • Often almost all values are 1 codepoint • But, must map to strings in a few cases • Don’t want overhead for strings always • Solution • Exception values indicate extra processing • Can use same solution for UTF-16 code units
Example • Get a character ch • Find its value v • If v is in [D800..E000], may be string • check v2 = valueException[v - D800] • if v2 not null, process it, continue • Process v
String Key ➠ Value • Problem • Often almost all keys are 1 codepoint • Must have string keys in a few cases • Don’t want overhead for strings always • Solution • Exception values indicate possible follow-on codepoints • Can use same solution for UTF-16 code units • Use key closure!
Closure • If (X + Y) is a key, then X is a key Before After s ➠ x s ➠ x ➠ sh ➠ y sh ➠ y shch ➠ z shch ➠ z c ➠ w c ➠ w shc ➠ yw
s h c h a … x y yw z not found,use last Why Closure?
Bitpacking • Squeeze information into value • Example: Character Properties • category: 5 bits • bidi: 4 bits (+ exceptions) • canonical category: 6 bits + expansion • compressCanon = [bits >> SHIFT] & MASK; • canon = expansionArray[compressCanon];
Statetables • Classic: • entry = stateTable[ state, ch ]; • state = entry.state; • doSomethingWith( entry.action ); • until (state < 0);
Statetables • Unicode: • type = trie[ch]; • entry = stateTable[ state, type ]; • state = entry.state; • doSomethingWith( entry.action ); • until (state < 0); • Also, String Key ➠ Value
Sample Data Structures: ICU • Trie: CompactArray • Customized for each datatype • Automatic expansion • Compact after setting • Character Properties • use CompactArray, Bitpacking • Inversion List: UnicodeSet • Boolean Operations
Sample Usage #1: ICU • Collation • Trie lookup • Expanding character: String Key ➠ Value • Contracting character: Key ➠ String Value • Break Iterators • For grapheme, word, line, sentence break • Statetable
Sample Usage #2: ICU • Transliteration • Requires • Mapping codepoints in context to others • Rearranging codepoints • Controlling the choice of mapping • Character Properties • Inversion List • Exception values
Sample Usage #3: ICU • Character Conversion • From Unicode to bytes • Trie • From bytes to Unicode • Arrays for simple maps • Statetables for complex maps • recognizes valid / invalid mappings • provides compaction • Complications • Invalid vs. Valid mapped vs. Valid unmapped • Fallbacks
References • Unicode Open Source — ICU • http://oss.software.ibm.com/icu • ICU4j: Java API • ICU4c: C and C++ APIs • Other references — see Mark’s website: • http://www.macchiato.com