ISO/IEC 10646 and Unicode

ISO/IEC 10646 and Unicode • It is a coded character set(codeset) • Designed for text processing and exchange • Features: • Universal: characters in almost all national standards • Framework: Fix the coding architectures, and code-points can be filled up later. • Uniform and Efficient: fixed-width encoding, no need to identify the coding length(ASCII, Big5, GB) • Unambiguous: Any given 16-bit(32-bit) value always represents the same character

UCS-4(Canonical form of ISO 10646) • Fixed 32-bit(actually 31 bits) coding assignment • 00 00 00 00 to 7F FF FF FF • Each plane: 216 = 65,536 code points • BMP(the basic multilingual plane) • Both Group No. and Plan No. are 00(first two bytes of zeros) • Before ISO 10646 part 2 came out(end of year 2001), only BMP contains characters Group No. (total: 128) Plane No (total: 256) High Byte (total: 256) Low Byte (total: 256)

Code Architecture of UCS-4 Groups Group 127 Group 1 Planes 256/Group Group 0 Plane 00 BMP

UCS-2: 2-byte representation of UCS-4 • Basic Multilingual Plan(BMP) • Switching mechanism to use code range of BMP to access another 16 planes (Surrogate pairs) • BMP • Compatibility Zone: A-Zone Alphabets, Symbols, CJK Misc I-Zone CJK ideographs O-zone Hangul S-Zone(Surrogate) R-Zone Private Use, Compatibility, Arabic Presentations

Unicode • Unicode is the implementation of ISO 10646 with 16 bit representation using UCS-2 • Has definition of actions associated with certain characters • control character behavior • Rendering behavior: combining characters • Examples • Control character bell <BEL> should cause a sound in the system • Type the character using U+0061(a)U+0300(̀)will be rendered as one symbol à

Extension of ISO 10646 • Extension A(BMP) has 6,582 characters, published in 2000, ISO/IEC 10646-1 Second Edition(2000). • Extension B: • All characters in 康熙字典，漢語大字典, plus other characters such as those in HK Supplementary Character Set, • ISO/IEC 10646-2(2001) , total of 43,253 characters • In Plane 2 of UCS-4 • How would Extension B be supported in UCS-2? => Using some encoding scheme

Surrogate Pairs • 2 UCS-2 code H followed by L <H,L> where • H is in the range of D800 - DBFF • L is in the range of DC00 - DFFF • For a given UCS-2 code(or code pair) U, the corresponding UCS-4 code-point value N (scalar value) • N= U if U is a single, non-surrogate value • N=(H-D80016)*400 16 + (L-DC00 16) + 10000 16 where U is a surrogate pair<H,L> • Undefined for any other U in UCS-2. • N: in the range of 0 to 10FFFF16 • <D800, DC00> => N = 1000016 • <DBFF,DFFF> => ?

UTF: UCS Transformation Format • Allows a certain number of code values in UCS which correspond to some other coding standard(e.g. ASCII) be transmitted just as what they would be in that coding standard, a property known as transparency-while other code values are represented through escape mechanism • variable length encoding to achieve greater efficiency

UTF-8: 8-bit encoding for 8-but UNIX Environment • ASCII transparent • First-byte indicates the number of characters • Shortest encoding principle for invertible (or bijective) encoding/decoding • Save storage space for ASCII, non-ideographic characters • Example:Unicode A324 0430 0023 8A43 => UTF-8: • Example: UTF-8 24 38 58 CE 82 => UCS-4:

Character vs. glyph • Character: smallest component of written language that have semantic value • Glyphs: represent the shapes that characters can have when they are rendered or displayed. • Example: A, A,are the same character and having the same code. Concrete shape can be very different and are given one codepoint. • Coding of variants

ISO 10646/Unicode Featuresfor Chinese • Han Unification (Chinese, Japanese and Korean) • Unification Problems: • Different sources, non-cognate • Three-dimensional Conceptual Model: semantics(x), abstract shape(y), actual shape(z) Examples

Unification Rules(認同規則) • R1: Source Separation Rule: If two ideographs are distinct in a primary source standard, then they are not unified.Why • R2: Non-cognate(非同源)Rule: In general, if two ideographs are unrelated in historical derivation(non-cognate characters), then they are not unified • R3: By means of two-level classification, the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are unified unless disallowed by R1 or R2.

Example: • Component structure analysis

Sources of Unified Han Characters

Wide character vs. Multi-byte characters • Text information needs to be represented by the right data types. • Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 • Wide characters: Fixed-byte encoding and no testing of high bit needed. • Processing representation for wide characters: • Big Endian vs. Little Endian • Data type dependent • System architecture dependent • Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian

ISO/IEC 10646 and Unicode