181 likes | 364 Views
Compact Encodings of Unicode. Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency. Agenda. Encodings in files and protocols Not: Processing encoding forms Unicode “is too big” Issues and non-issues How to reduce size of Unicode text Choice of encoding
E N D
Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency
Agenda • Encodings in files and protocols • Not: Processing encoding forms • Unicode “is too big” • Issues and non-issues • How to reduce size of Unicode text • Choice of encoding • Optional compression • Examples and comparisons 22nd International Unicode Conference
What is ICU? • Internationalization libraries for C, C++, Java* • Open source – non-viral • Sponsored by IBM • Sun’s Java licenses an earlier ICU version; ICU4J updates it. • Unicode standard compliant • full supplementary support • Cross-platform; extensible and customizable • High performance and thread-safe • Multiple locales in same thread – simultaneously • Converters for all Unicode charsets & hundreds of legacy codepages • http://oss.software.ibm.com/icu/ 22nd International Unicode Conference
Encodings of Unicode • Common Unicode character set • External encodings • Files and protocols • Almost always byte-serialized • Character Encoding Schemes/charsets • Processing encodings • Character Encoding Forms, often 16/32-bit • Different requirements • Topic for different presentation… 22nd International Unicode Conference
Unicode “is too big”? • Perceived large size of Unicode text • Compared with legacy codepages • Size matters • Low-speed connections (dial-up, mobile) • Little memory (PDA, cell phone, embedded) • Size does not matter when… • Images & other binaries swamp text size • High-speed network • Temporary documents • Large amounts of memory 22nd International Unicode Conference
How big is it? • Size depends on language/script • Bytes/char for some language groups: 22nd International Unicode Conference
Legacy codepages • Compact because • Designed for single/few languages • Few characters compared with Unicode • Conversion problems • Fallback/substitution of unmappable chars • Mapping table differences • Loss of parts of text common • Large number/size of mapping tables 22nd International Unicode Conference
Reduce Unicode text size • Choice of encoding • Encodings designed for different purposes • Compactness vs. direct applicability vs. software support etc. • General-purpose compression • Best on top of compact encoding • Not available in all applications 22nd International Unicode Conference
UTF-8/16 • Designed for processing but all-purpose • UTF-8: • Byte-based, ASCII-compatible • BMP: up to 3 bytes/char • UTF-16 (BE/LE): • Byte-serialization of 16-bit form, not ASCII-compatible • BE/LE forms or Byte Order Mark • BMP: always 2 bytes/char 22nd International Unicode Conference
UTF-7 • 7-bit encoding designed for email • Obsolete: email now 8-bit-safe • Partially ASCII-compatible • BMP: 2.67 bytes/char plus overhead • Base64-encoded UTF-16BE • Stateful 22nd International Unicode Conference
SCSU & BOCU-1 • About as compact as legacy codepages • 1 byte/char for small scripts, 2 for CJK; stateful • Compress short strings better than LZW (zip) etc. • SCSU: • Limited* ASCII compatibility (initial state) • Complex state, many encoding choices • Indeterministic; arbitrary byte values • Established encoding, supported in • Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs) 22nd International Unicode Conference
BOCU-1 • BOCU-1: • Delta-encoding; avoids control codes • MIME text-compatible but not ASCII • Deterministic • Preserves binary order (for sorting, databases) • New encoding; supported by ICU 22nd International Unicode Conference
SCSU & BOCU-1 text sizes • Average bytes/char relative to UTF-8 22nd International Unicode Conference
Encoding vs. compression • For example: BOCU-1 with WinZip 22nd International Unicode Conference
Performance • Converter performance • Roundtrip to/from UTF-16 with ICU: • SCSU: 45%..125% of UTF-8 roundtrip time • BOCU-1: 40%..160% of UTF-8 roundtrip time • Depends on encoding ratio • Fast for small scripts, 1 byte/char • Separate compression adds to I/O time • Conversion time typically swamped by • Transmission (low-bandwidth connections) • Shorter texts transmit faster! • Parsing/processing 22nd International Unicode Conference
Further considerations • In-document encoding declarations require ASCII readability (XML, HTML) • Protocol may limit byte values (SMTP) • TES required for some encodings • base64 for SCSU or UTF-16 in emails • Increases text size • Compression removes ASCII readability and uses arbitrary byte values 22nd International Unicode Conference
Conclusion • UTF-8 and/or UTF-16 work in most cases • Size of text often not critical • When small text size needed: • Use SCSU or BOCU-1 • Consider compression • Make sure receiver can handle it 22nd International Unicode Conference
References • Forms of Unicode: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ • Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/ • SCSU: UTS #6 http://www.unicode.org/reports/tr6/ • BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html • ICU homepage: http://oss.software.ibm.com/icu/ • Unicode Consortium:http://www.unicode.org/ • IBM developerWorks:http://www.ibm.com/developerworks/unicode/ 22nd International Unicode Conference