1 / 15

ISO/IEC 10646 and Unicode

Learn about ISO/IEC 10646 and Unicode, the universal coded character set designed for text processing and exchange. Explore the fixed-width encoding, unambiguous representation, BMP, UCS-2, Unicode extensions, and UTF-8 encoding advances. Understand the character vs glyph concept, Han unification, and multi-byte vs wide character processing. Gain insights into the efficiency, compatibility, and standards of modern text representation.

eyler
Download Presentation

ISO/IEC 10646 and Unicode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISO/IEC 10646 and Unicode • It is a coded character set(codeset) • Designed for text processing and exchange • Features: • Universal: characters in almost all national standards • Framework: Fix the coding architectures, and code-points can be filled up later. • Uniform and Efficient: fixed-width encoding, no need to identify the coding length(ASCII, Big5, GB) • Unambiguous: Any given 16-bit(32-bit) value always represents the same character

  2. UCS-4(Canonical form of ISO 10646) • Fixed 32-bit(actually 31 bits) coding assignment • 00 00 00 00 to 7F FF FF FF • Each plane: 216 = 65,536 code points • BMP(the basic multilingual plane) • Both Group No. and Plan No. are 00(first two bytes of zeros) • Before ISO 10646 part 2 came out(end of year 2001), only BMP contains characters Group No. (total: 128) Plane No (total: 256) High Byte (total: 256) Low Byte (total: 256)

  3. Code Architecture of UCS-4 Groups Group 127 Group 1 Planes 256/Group Group 0 Plane 00 BMP

  4. UCS-2: 2-byte representation of UCS-4 • Basic Multilingual Plan(BMP) • Switching mechanism to use code range of BMP to access another 16 planes (Surrogate pairs) • BMP • Compatibility Zone: A-Zone Alphabets, Symbols, CJK Misc I-Zone CJK ideographs O-zone Hangul S-Zone(Surrogate) R-Zone Private Use, Compatibility, Arabic Presentations

  5. Unicode • Unicode is the implementation of ISO 10646 with 16 bit representation using UCS-2 • Has definition of actions associated with certain characters • control character behavior • Rendering behavior: combining characters • Examples • Control character bell <BEL> should cause a sound in the system • Type the character using U+0061(a)U+0300(̀)will be rendered as one symbol à

  6. Extension of ISO 10646 • Extension A(BMP) has 6,582 characters, published in 2000, ISO/IEC 10646-1 Second Edition(2000). • Extension B: • All characters in 康熙字典,漢語大字典, plus other characters such as those in HK Supplementary Character Set, • ISO/IEC 10646-2(2001) , total of 43,253 characters • In Plane 2 of UCS-4 • How would Extension B be supported in UCS-2? => Using some encoding scheme

  7. Surrogate Pairs • 2 UCS-2 code H followed by L <H,L> where • H is in the range of D800 - DBFF • L is in the range of DC00 - DFFF • For a given UCS-2 code(or code pair) U, the corresponding UCS-4 code-point value N (scalar value) • N= U if U is a single, non-surrogate value • N=(H-D80016)*400 16 + (L-DC00 16) + 10000 16 where U is a surrogate pair<H,L> • Undefined for any other U in UCS-2. • N: in the range of 0 to 10FFFF16 • <D800, DC00> => N = 1000016 • <DBFF,DFFF> => ?

  8. UTF: UCS Transformation Format • Allows a certain number of code values in UCS which correspond to some other coding standard(e.g. ASCII) be transmitted just as what they would be in that coding standard, a property known as transparency-while other code values are represented through escape mechanism • variable length encoding to achieve greater efficiency

  9. UTF-8: 8-bit encoding for 8-but UNIX Environment • ASCII transparent • First-byte indicates the number of characters • Shortest encoding principle for invertible (or bijective) encoding/decoding • Save storage space for ASCII, non-ideographic characters • Example:Unicode A324 0430 0023 8A43 => UTF-8: • Example: UTF-8 24 38 58 CE 82 => UCS-4:

  10. Character vs. glyph • Character: smallest component of written language that have semantic value • Glyphs: represent the shapes that characters can have when they are rendered or displayed. • Example: A, A,are the same character and having the same code. Concrete shape can be very different and are given one codepoint. • Coding of variants

  11. ISO 10646/Unicode Featuresfor Chinese • Han Unification (Chinese, Japanese and Korean) • Unification Problems: • Different sources, non-cognate • Three-dimensional Conceptual Model: semantics(x), abstract shape(y), actual shape(z) Examples

  12. Unification Rules(認同規則) • R1: Source Separation Rule: If two ideographs are distinct in a primary source standard, then they are not unified.Why • R2: Non-cognate(非同源)Rule: In general, if two ideographs are unrelated in historical derivation(non-cognate characters), then they are not unified • R3: By means of two-level classification, the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are unified unless disallowed by R1 or R2.

  13. Example: • Component structure analysis

  14. Sources of Unified Han Characters

  15. Wide character vs. Multi-byte characters • Text information needs to be represented by the right data types. • Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 • Wide characters: Fixed-byte encoding and no testing of high bit needed. • Processing representation for wide characters: • Big Endian vs. Little Endian • Data type dependent • System architecture dependent • Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian

More Related