1 / 22

Supplementary Character Support in Microsoft Products

Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft. Supplementary Character Support in Microsoft Products. What are supplementary characters?.

stormy
Download Presentation

Supplementary Character Support in Microsoft Products

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft Supplementary Character Support in Microsoft Products

  2. What are supplementary characters? • "a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate" Prague, Czech Republic (IUC23)

  3. High/low surrogate? • High: U+D800 - U+DBFF • Low: U+DC00 - U+DFFF • Terminology: • "surrogate pair" preferred over "surrogate character“ • See http://www.trigeminal.com/16to32AndBack.asp Prague, Czech Republic (IUC23)

  4. Conversion example #1 • Example #1: • The first character in the Surrogate range (D800, DC00) as UTF-32: • 1. D800: binary 1101100000000000 (lower ten bits: 0000000000) • 2. DC00: binary 1101110000000000 (lower ten bits: 0000000000) • 3. Concatenate 0000000000+0000000000 = x0000 • 4. Add x10000 Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF) Prague, Czech Republic (IUC23)

  5. Conversion example #2 • Example #2. • You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16 • 1. Subtract x10000 - Result: 1040A • 2. Split into two ten-bit pieces: 0001000001 0000001010 • 3. Add 1101100000000000 (D800) to the high 10 bits piece (0001000001) - Result: 1101100001000001 (D841) • 4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A) Your surrogate pair: D841, DC0A Prague, Czech Republic (IUC23)

  6. UTF-8 conversions • Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) • legal conversions: four-byte UTF-8 (one UTF-32 code point) • CESU-8 is the the inverse of the above Prague, Czech Republic (IUC23)

  7. UTF-8 example • Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx • becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx • Instead, you should take a Unicode surrogate pair: 110110wwwwzzzzyy, 110111yyyyxxxxxx • and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx Prague, Czech Republic (IUC23)

  8. Encoding choices for MS • UTF-16, mostly • Occasionally UTF-8 • Even more occasionally, UTF-32 REASONS: • There was obviously an existing, well-tested set of APIs that support UCS-2, which is a subset of UTF-16. • A completely new API set was not required. • A move to UTF-32 would require twice as much space for all characters. • A move to UTF-8 would require even more than twice as much space in many cases. Prague, Czech Republic (IUC23)

  9. The products... • Mostly the new generation of products: • Windows 2000/XP • Office XP (some support in Office 2000) • Visual Studio.Net • .NET’s Common Language Runtime (CLR) • Most (all) of these products supported Unicode already • a little bit of extra work needed for supplementary characters • usually just UTF-8 changes were needed Prague, Czech Republic (IUC23)

  10. Windows 2000 • Uniscribe support for rendering • Each surrogate pair is a single grapheme • APIs like CharPrev/CharNext not changed • No specific surrogate font/IME • Must be turned on: http://msdn.microsoft.com/library/en-us/intl/unicode_192r.asp Prague, Czech Republic (IUC23)

  11. Windows XP • *.* from Windows 2000 • Turned on by default! • GDI+ support for rendering • Font CMAP extensions • Lots of UTF-8 issues fixed • No specific surrogate font/IME (yet) • Extensions to fallback fonts [limited]: HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3(etc.) Prague, Czech Republic (IUC23)

  12. Other system components • MLang • Internet Explorerhttp://i18nWithVB.com/surrogate_ime/ • IIS 5.0/6.0 Prague, Czech Republic (IUC23)

  13. The downlevel story • No good support for Unicode, let alone supplementary characters • Uniscribe/RichEdit does improve the downlevel story for display purposes • Officially, no support on Win9x Prague, Czech Republic (IUC23)

  14. The Office suite • Word • Frontpage • Excel/Access • Outlook • RichEdit 4.0 Prague, Czech Republic (IUC23)

  15. Office - Specific Features • Insertion/Deletion of text - All • Cursor movement - All • Font linking/fallback - All (Word's is best) • UTF-8 issues fixed - All • Enhanced word breaking - All (Word/RichEdit) • Vertical text - Word/PowerPoint/Publisher/RichEdit • Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit Prague, Czech Republic (IUC23)

  16. CHS/CHT/CHP Office • The product and the langpacks support an extended Unicode IME that handles supplementary characters • An Extension B font is also included Prague, Czech Republic (IUC23)

  17. .NET CLR/Visual Studio.NET • String class and globalization namespace • StringInfo • GetTextElementEnumerator • Handles supplementary characters • Also handles composite characters • GDI+ • VS IDE support Prague, Czech Republic (IUC23)

  18. SQL Server • Past - no support (for Unicode, even!) • Present - surrogate "safe" (neutral) • Future - surrogate “aware” Prague, Czech Republic (IUC23)

  19. Items not [currently] supported • Character Map • Graph 10 • Outlook 10 mail headers • Fonts/IMEs • “Collations” for supplementary characters Prague, Czech Republic (IUC23)

  20. Collation plan for supplementary characters in the UCA? • All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. • All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. • All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. • Plane 14 language tags will be treated as if they were unassigned. • All characters encoded in Plane 15-16 (private use) will be sorted after all other characters. Prague, Czech Republic (IUC23)

  21. Questions? Prague, Czech Republic (IUC23)

  22. Supplementary Character Support in Microsoft Products Don’t forget to fill out your evals! Prague, Czech Republic (IUC23)

More Related