1 / 23

Unicode Normalization

Unicode Normalization. Mark Davis www.macchiato.com. Normalization. Uniqueness two equivalent strings have precisely the same normalized form Fast binary comparison, accurate digital signatures Recommended for XML, JavaScript and other standards. Canonical Equivalence.

cara-thomas
Download Presentation

Unicode Normalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode Normalization Mark Davis www.macchiato.com

  2. Normalization • Uniqueness • two equivalent strings have precisely the same normalized form • Fast binary comparison,accurate digital signatures • Recommended for XML, JavaScript and other standards

  3. Canonical Equivalence • Fundamental equivalence • Indistinguishable to users, when correctly rendered • Includes • Combining sequences • Hangul • Singletons Ç C ¸ 가 ㄱ ㅏ Ω Ω

  4. Compatibility Equivalence • Formatting differences • Font variants (ℌ) • Breaking differences (-) • Cursive forms (ﻦ ﻨ ﻧ ﻥ) • Circled (⑪) • Width, size, rotated (カ ﹠ ︷) • Super/subscripts (₉ ⁹) • Squared characters (㌀) • Fractions (⅚) • Others (dž) カ カ ㎏ k g fi f i

  5. UTR #15:Unicode Normalization Forms

  6. Normalization Requirement • Uniqueness: two equivalent strings will have precisely the same normalized form • If two strings x and y are canonical equivalents, then C(x) = C(y) D(x) = D(y) • If two strings are compatibility equivalents, then KC(x) = KC(y) KD(x) = KD(y)

  7. Affected Characters • None of the forms affect text with only ASCII characters (U+0000 to U+007F) • None of the forms generate compability characters that were not in the source text. • Both KD and KC replace compatibility characters. • Both D and C maintain compatibility characters.

  8. Cautions: Decomposition • Requires decomposition mappings from the Unicode Character Database • Those decomposition mappings must be applied recursively • The string must be put into canonical order • Either Canonical or Compatibility

  9. Cautions: Composition • Decomposition required first! • Then canonical composition • Composition data: fixed at Unicode 3.0.0 • Some characters are excluded from composition • Form C and Form KC can still have combining characters! • Required for Indic, Arabic, Hebrew, &c.

  10. Caution: Both C & D • All normalization forms are not closed under string concatenation. Example: • NFC/D "…a◌̰" + "◌̀…" • Not Norm."…a◌̰◌̀…" • NFC "…à◌̰…" • NFD "…a◌̀◌̰…" • Exceptions easy to test for

  11. Composition Process • Decompose (D or KD) • Combine unblocked characters with the previous starter, if possible*

  12. Composition Exclusions • Script Specifics क + ◌̣ ⇏ क़ • Futures: G + ◌̣ ⇏ G̣ • Singletons* Ω ⇏ Ω • Non-starter sequences* ◌̈ + ◌́ ⇏ ◌̈́

  13. Legacy Encoding • Legacy text is ‘normalized’ if it maps 1:1 to normalized Unicode text • Legacy sets: • Prenormalized: e.g. ISO 8859-1 • Normalizable: e.g. ISO 2022 (ISO 5426/ISO 8859-1/…) • Unnormalizable: e.g. ISO 5426

  14. Programming Identifiers • Closed under all Normalization Forms, if minor changes incorporated • Modified syntax: • identifier := start ( start | extend )* • start := [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]- irregulars – combining_like • extend := [{Mn}{Mc}{Nd}{Pc}{Cf}]- irregulars + combining_like + mid_dot • (Almost) closed under Case Mappings • see SpecialCasing.txt

  15. Resources • Reference version on Unicode Site • Production Version • http://oss.software.ibm.com/icu • ICU: C/C++ and Java Versions • Open Source, with IBM Public License • Free commercial use and distribution: Not Viral! • Panel Later today • Other companies also providing: ask!

  16. Normalization • Uniqueness: two equivalent strings have precisely the same normalized form • Fast binary comparison, accurate digital signatures • Recommended for XML, JavaScript and other standards

  17. Q & A

  18. Backup Slides

  19. Definition: Starter • S is a starter = • Canonical class of zero in the Unicode Character Database • Can start a composition • Examples: Starters: Spacing marks, some non-spacing ‘a’, ‘ق’ ‘Θ’ ‘क’ ‘ी’ ‘◌ै’ Non-starters: most non-spacing marks ‘◌̀’, ‘◌̊’ ‘◌̽’ ‘◌̥’

  20. Definition: Blocked • C is blocked from S • There is some character B between S and C, and either • B is a starter or • B has the same canonical class as C • Examples • “ABC” – B blocks C from A • “A◌̀◌̊” – ◌̀ blocks ◌̊ from A • “A◌̥◌̊” – ◌̥doesn’t block ◌̊ from A

  21. Testing Conformance: Canonical

  22. Unicode Normalization • Introduction • Normalization forms • Design goals • Specification • Excluded characters • Versions • Legacy encodings • Applications

  23. Characters and Encoding Forms Abstract Encoded Serialized UTF-16BE UTF-8 C5 00 C5 C3 85 212B 21 2B E2 84 AB Å F0000 DB 80 DC 00 F3 B0 80 80 00 61 03 0A 61 CC 8A A 61 30A °

More Related