210 likes | 389 Views
Michael S. Kaplan Software Design Engineer Trigeminal Software, Inc. Surrogate Support in Microsoft Products. What are surrogates?.
E N D
Michael S. Kaplan Software Design Engineer Trigeminal Software, Inc. Surrogate Support in Microsoft Products Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
What are surrogates? • "a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate" Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
High/low surrogate? • High: U+D800 - U+DBFF • Low: U+DC00 - U+DFFF • Terminology: • "surrogate pair" preferred over "surrogate character" Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Conversion example #1 • Example #1: • The first character in the Surrogate range (D800, DC00) as UTF-32: • 1. D800: binary 1101100000000000 (lower ten bits: 0000000000) • 2. DC00: binary 1101110000000000 (lower ten bits: 0000000000) • 3. Concatenate 0000000000+0000000000 = x0000 • 4. Add x10000 Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF) Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Conversion example #2 • Example #2. • You have a Unicode character such as U+2040A (a CJK character in Plane2) and wish to encode it in UTF-16 • 1. Subtract x10000 - Result: 1040A • 2. Split into two ten-bit pieces: 0001000001 0000001010 • 3. Add 1101100000000000 (D800) to the high 10 bits piece (0001000001) - Result: 1101100001000001 (D841) • 4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A) Your surrogate pair: D841, DC0A Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
UTF-8 conversions • Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) • legal conversions: four-byte UTF-8 (one UTF-32 code point) Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
UTF-8 example • Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx • becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx • Instead, you should take a Unicode surrogate pair: 110110wwwwzzzzyy, 110111yyyyxxxxxx • and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Encoding choices for MS • UTF-16, mostly • Occasionally UTF-8 • Even more occasionally, UTF-32 REASONS: • There was obviously an existing, well-tested set of APIs that support UCS-2, which is a total subset of UTF-16. • A completely new API set was not required. • A move to UTF-32 would require twice as much space for all characters. • A move to UTF-8 would require even more than twice as much space in many cases. Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
The products... • Mostly the new generation of products: • Windows 2000/XP • Office XP (some support in Office 2000) • Most of these products supported Unicode already • a little bit of extra work needed for surrogate pairs • usually just UTF-8 support needed Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Windows 2000/XP • Uniscribe/GDI+ support for rendering • Each surrogate pair is a single grapheme • APIs like CharPrev/CharNext not changed • Extensions to fallback fonts in XP • Font CMAP extensions in XP • Lots of UTF-8 issues fixed in XP • No specific surrogate font/IME (yet) Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Collation for Supplementary chacacters • All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. • All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. • All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. (includes plane 14 language tags) • All characters encoded in Plane 15-16 (private use) will be sorted after all other characters. Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Other system components • MLang • Internet Explorer • IIS 5.0/6.0 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
The downlevel story • No good support for Unicode, let along supplementary characters • Uniscribe/RichEdit does improve the downlevel story for display purposes, at least • Officially, no surrgoate support on Win9x Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
The Office suite • Word • Frontpage • Excel/Access • Outlook • RichEdit 4.0 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Specific Features • Insertion/Deletion of text - All • Cursor movement - All • Font linking/fallback - All (Word's is best) • UTF-8 issues fixed - All • Enhanced word breaking - All (Word/RichEdit) • Vertical text - Word/PowerPoint/Publisher/RichEdit • Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
CHS/CHT/CHP Office • The product and the langpacks support an extended Unicode IME that handles supplementary characters • An Extension B font is also included Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Visual Studio[.NET] • String class and globalization namespace • StringInfo • GetTextElementEnumerator • Handles supplementary characters • Also handles composite characters • GDI+ • IDE support Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
SQL Server • Past - no support • Present - surrogate "safe" (neutral) • Future - surrogate awaree Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Items not supported • Character Map • Graph 10 • Outlook 10 mail headers • Collations for supplementary characters • Fonts/IMEs Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Questions? Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)
Surrogate Support in Microsoft Products Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)