1 / 18

Compact Encodings of Unicode

Compact Encodings of Unicode. Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency. Agenda. Encodings in files and protocols Not: Processing encoding forms Unicode “is too big” Issues and non-issues How to reduce size of Unicode text Choice of encoding

vidar
Download Presentation

Compact Encodings of Unicode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

  2. Agenda • Encodings in files and protocols • Not: Processing encoding forms • Unicode “is too big” • Issues and non-issues • How to reduce size of Unicode text • Choice of encoding • Optional compression • Examples and comparisons 22nd International Unicode Conference

  3. What is ICU? • Internationalization libraries for C, C++, Java* • Open source – non-viral • Sponsored by IBM • Sun’s Java licenses an earlier ICU version; ICU4J updates it. • Unicode standard compliant • full supplementary support • Cross-platform; extensible and customizable • High performance and thread-safe • Multiple locales in same thread – simultaneously • Converters for all Unicode charsets & hundreds of legacy codepages • http://oss.software.ibm.com/icu/ 22nd International Unicode Conference

  4. Encodings of Unicode • Common Unicode character set • External encodings • Files and protocols • Almost always byte-serialized • Character Encoding Schemes/charsets • Processing encodings • Character Encoding Forms, often 16/32-bit • Different requirements • Topic for different presentation… 22nd International Unicode Conference

  5. Unicode “is too big”? • Perceived large size of Unicode text • Compared with legacy codepages • Size matters • Low-speed connections (dial-up, mobile) • Little memory (PDA, cell phone, embedded) • Size does not matter when… • Images & other binaries swamp text size • High-speed network • Temporary documents • Large amounts of memory 22nd International Unicode Conference

  6. How big is it? • Size depends on language/script • Bytes/char for some language groups: 22nd International Unicode Conference

  7. Legacy codepages • Compact because • Designed for single/few languages • Few characters compared with Unicode • Conversion problems • Fallback/substitution of unmappable chars • Mapping table differences • Loss of parts of text common • Large number/size of mapping tables 22nd International Unicode Conference

  8. Reduce Unicode text size • Choice of encoding • Encodings designed for different purposes • Compactness vs. direct applicability vs. software support etc. • General-purpose compression • Best on top of compact encoding • Not available in all applications 22nd International Unicode Conference

  9. UTF-8/16 • Designed for processing but all-purpose • UTF-8: • Byte-based, ASCII-compatible • BMP: up to 3 bytes/char • UTF-16 (BE/LE): • Byte-serialization of 16-bit form, not ASCII-compatible • BE/LE forms or Byte Order Mark • BMP: always 2 bytes/char 22nd International Unicode Conference

  10. UTF-7 • 7-bit encoding designed for email • Obsolete: email now 8-bit-safe • Partially ASCII-compatible • BMP: 2.67 bytes/char plus overhead • Base64-encoded UTF-16BE • Stateful 22nd International Unicode Conference

  11. SCSU & BOCU-1 • About as compact as legacy codepages • 1 byte/char for small scripts, 2 for CJK; stateful • Compress short strings better than LZW (zip) etc. • SCSU: • Limited* ASCII compatibility (initial state) • Complex state, many encoding choices • Indeterministic; arbitrary byte values • Established encoding, supported in • Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs) 22nd International Unicode Conference

  12. BOCU-1 • BOCU-1: • Delta-encoding; avoids control codes • MIME text-compatible but not ASCII • Deterministic • Preserves binary order (for sorting, databases) • New encoding; supported by ICU 22nd International Unicode Conference

  13. SCSU & BOCU-1 text sizes • Average bytes/char relative to UTF-8 22nd International Unicode Conference

  14. Encoding vs. compression • For example: BOCU-1 with WinZip 22nd International Unicode Conference

  15. Performance • Converter performance • Roundtrip to/from UTF-16 with ICU: • SCSU: 45%..125% of UTF-8 roundtrip time • BOCU-1: 40%..160% of UTF-8 roundtrip time • Depends on encoding ratio • Fast for small scripts, 1 byte/char • Separate compression adds to I/O time • Conversion time typically swamped by • Transmission (low-bandwidth connections) • Shorter texts transmit faster! • Parsing/processing 22nd International Unicode Conference

  16. Further considerations • In-document encoding declarations require ASCII readability (XML, HTML) • Protocol may limit byte values (SMTP) • TES required for some encodings • base64 for SCSU or UTF-16 in emails • Increases text size • Compression removes ASCII readability and uses arbitrary byte values 22nd International Unicode Conference

  17. Conclusion • UTF-8 and/or UTF-16 work in most cases • Size of text often not critical • When small text size needed: • Use SCSU or BOCU-1 • Consider compression • Make sure receiver can handle it 22nd International Unicode Conference

  18. References • Forms of Unicode: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ • Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/ • SCSU: UTS #6 http://www.unicode.org/reports/tr6/ • BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html • ICU homepage: http://oss.software.ibm.com/icu/ • Unicode Consortium:http://www.unicode.org/ • IBM developerWorks:http://www.ibm.com/developerworks/unicode/ 22nd International Unicode Conference

More Related