Compact Encodings of Unicode

Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

Agenda • Encodings in files and protocols • Not: Processing encoding forms • Unicode “is too big” • Issues and non-issues • How to reduce size of Unicode text • Choice of encoding • Optional compression • Examples and comparisons 22nd International Unicode Conference

What is ICU? • Internationalization libraries for C, C++, Java* • Open source – non-viral • Sponsored by IBM • Sun’s Java licenses an earlier ICU version; ICU4J updates it. • Unicode standard compliant • full supplementary support • Cross-platform; extensible and customizable • High performance and thread-safe • Multiple locales in same thread – simultaneously • Converters for all Unicode charsets & hundreds of legacy codepages • http://oss.software.ibm.com/icu/ 22nd International Unicode Conference

Encodings of Unicode • Common Unicode character set • External encodings • Files and protocols • Almost always byte-serialized • Character Encoding Schemes/charsets • Processing encodings • Character Encoding Forms, often 16/32-bit • Different requirements • Topic for different presentation… 22nd International Unicode Conference

Unicode “is too big”? • Perceived large size of Unicode text • Compared with legacy codepages • Size matters • Low-speed connections (dial-up, mobile) • Little memory (PDA, cell phone, embedded) • Size does not matter when… • Images & other binaries swamp text size • High-speed network • Temporary documents • Large amounts of memory 22nd International Unicode Conference

How big is it? • Size depends on language/script • Bytes/char for some language groups: 22nd International Unicode Conference

Legacy codepages • Compact because • Designed for single/few languages • Few characters compared with Unicode • Conversion problems • Fallback/substitution of unmappable chars • Mapping table differences • Loss of parts of text common • Large number/size of mapping tables 22nd International Unicode Conference

Reduce Unicode text size • Choice of encoding • Encodings designed for different purposes • Compactness vs. direct applicability vs. software support etc. • General-purpose compression • Best on top of compact encoding • Not available in all applications 22nd International Unicode Conference

UTF-8/16 • Designed for processing but all-purpose • UTF-8: • Byte-based, ASCII-compatible • BMP: up to 3 bytes/char • UTF-16 (BE/LE): • Byte-serialization of 16-bit form, not ASCII-compatible • BE/LE forms or Byte Order Mark • BMP: always 2 bytes/char 22nd International Unicode Conference

UTF-7 • 7-bit encoding designed for email • Obsolete: email now 8-bit-safe • Partially ASCII-compatible • BMP: 2.67 bytes/char plus overhead • Base64-encoded UTF-16BE • Stateful 22nd International Unicode Conference

SCSU & BOCU-1 • About as compact as legacy codepages • 1 byte/char for small scripts, 2 for CJK; stateful • Compress short strings better than LZW (zip) etc. • SCSU: • Limited* ASCII compatibility (initial state) • Complex state, many encoding choices • Indeterministic; arbitrary byte values • Established encoding, supported in • Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs) 22nd International Unicode Conference

BOCU-1 • BOCU-1: • Delta-encoding; avoids control codes • MIME text-compatible but not ASCII • Deterministic • Preserves binary order (for sorting, databases) • New encoding; supported by ICU 22nd International Unicode Conference

SCSU & BOCU-1 text sizes • Average bytes/char relative to UTF-8 22nd International Unicode Conference

Encoding vs. compression • For example: BOCU-1 with WinZip 22nd International Unicode Conference

Performance • Converter performance • Roundtrip to/from UTF-16 with ICU: • SCSU: 45%..125% of UTF-8 roundtrip time • BOCU-1: 40%..160% of UTF-8 roundtrip time • Depends on encoding ratio • Fast for small scripts, 1 byte/char • Separate compression adds to I/O time • Conversion time typically swamped by • Transmission (low-bandwidth connections) • Shorter texts transmit faster! • Parsing/processing 22nd International Unicode Conference

Further considerations • In-document encoding declarations require ASCII readability (XML, HTML) • Protocol may limit byte values (SMTP) • TES required for some encodings • base64 for SCSU or UTF-16 in emails • Increases text size • Compression removes ASCII readability and uses arbitrary byte values 22nd International Unicode Conference

Conclusion • UTF-8 and/or UTF-16 work in most cases • Size of text often not critical • When small text size needed: • Use SCSU or BOCU-1 • Consider compression • Make sure receiver can handle it 22nd International Unicode Conference

References • Forms of Unicode: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ • Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/ • SCSU: UTS #6 http://www.unicode.org/reports/tr6/ • BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html • ICU homepage: http://oss.software.ibm.com/icu/ • Unicode Consortium:http://www.unicode.org/ • IBM developerWorks:http://www.ibm.com/developerworks/unicode/ 22nd International Unicode Conference

Compact Encodings of Unicode

Compact Encodings of Unicode

Presentation Transcript

Unicode

Bits of Unicode

Unicode Introduction

Satisfiability Encodings

Unicode 4.0

Unicode in

Image encodings

Image encodings

Unicode

Dzongkha Unicode

Unicode

Temporality and Encodings

UNICODE

Encodings

Understanding Character Encodings

Compact Encodings of Graphs

Unicode 4.0