390 likes | 608 Views
Unicode 4.0. Mark Davis President, The Unicode Consortium Note: slides differ from proceedings. Overview. New Characters Conformance UAX: Unicode Standard Annexes UCD: Unicode Character Database UTS: Unicode Technical Standards Not part of the Standard, but can claim conformance.
E N D
Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings
Overview • New Characters • Conformance • UAX: Unicode Standard Annexes • UCD: Unicode Character Database • UTS: Unicode Technical Standards • Not part of the Standard, but can claim conformance
Properties and Behavior • Unicode is not just a list of characters • Properties and behavior are crucial • With them, new characters can work “out of the box” • Some are part of the standard (BIDI, Normalization), others are associated (Collation, Regular Expressions)
New Characters: 1,228 • Modern Scripts • (additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac • (minority scripts) Limbu, Tai Le, Osmanya • Historic Scripts • Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers • Symbols • Monograms, digrams, tetragrams, other symbols • modifier & combining characters
New Characters (cont.) • Special Characters • additional variation selectors (for future CJK variants), double-diacritics for dictionary use • For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts. • Character repertoire corresponds to ISO/IEC 10646:2003.
Conformance • Substantially improved specification of conformance requirements • Incorporated UTR #17:Character Encoding Model, clearly separating encoding forms and encoding schemes • Tightened definitions of UTF-8, UTF-16, UTF-32 • Separate definition of Unicode String • Clarified conformance status of Unicode Standard Annexes • Formal definitions of properties & algorithms • Provisional properties
UTF vs. Unicode String • Important Distinction • UTF • Unique representation for Code Point • All else illegal • C0 80 • D800 0061 • Unicode String • Sequence of code units • Internal Processing, not interchange • Not necessarily valid UTF • C0 A0 • D800 0061
Conformance (cont.) • Formalized policies for stability of the standard • Clarification of semantics of important characters, including BOM • Revised scope of enclosing combining marks • Revised semantics of ZWJ for cursive scripts • Normalization Corrections • U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF • All corrections subject to strict stability constraints: • For 3.2 repertoire, NFC3.2(X) = NFC4.0(X)
Textual Clarifications • Major changes to Chapters 2, 3, 6, 14 and 15 • Definitive terminology for code points: • graphic, format, control, private-use • = assigned characters • surrogate, noncharacter, reserved • not characters • Substantial improvements to many character block descriptions, especially Indic
Programming language identifiers • Now backwards-compatible • Once a Unicode identifier, • Always a Unicode identifier • Alternate definition for complete stability • Fix set of allowed characters • Allow all reserved code points • + Complete stability • - “Odd” characters • Also see new UTR on Syntax Characters
Case mappings now normative (but tailorable) • Clearer definition of string functions: • isUpper(), isLower(), isTitle(), isFold() • toUpper(), toLower(), toTitle(), toFold() • Definition of titlecase uses word boundaries • Note that the Turkic mappings do not maintain canonical equivalence, without additional processing.
UAX #9:BIDI • BIDI: Arabic/Hebrew Display • HTML, all modern word processors, OSs,… • New: • canonically equivalence now preserved • data change, not algorithm • shaping is done after reordering • but not across directional boundaries • clarifications of: • ZWJ, ZWNJ • intermediate level processing
UAX #15: Normalization • Unique form for text comparison • W3C Character Model, International Domain Names, Network File System,… • New: • Description of Stable Code Points. • Notation NFC(x) and isNFC(x), in Notation. • Added pointer to UTN #5 Canonical Equivalences in Applications • Rewrote Annex 12: Corrigenda for clarity, and to describe the use of Normalization Corrections. • Added Annex 13: Canonical Equivalence.
UAX #14: Line Breaking • Line-Break (word-wrap) all Unicode text • Customizable for different languages • New: • Negative numbers and dates with hyphens will not break across lines • Word-Joiner will link any characters (except hard line breaks) • Behavior of soft hyphen clarified • marks opportunity for breaking, not specific graphic appearance. • Rules for GL relaxed: SP and ZW override • New Property Values: NL, WJ
UAX #29: Text Boundaries • Default “User Character”, Word, Sentence boundaries • Customizable for different languages • Word, sentence: tailoring expected • New: • Extracted from 3.0, but significantly revised • Grapheme cluster (“user character”) • Hangul Syllable or other Base • plus (optionally) any number of NSMs
No Sub. Changes • UAX #11:East Asian Width • Guidelines for choosing character width • UAX #24: Script Names • Default script assignment • Used in regular expressions • Now UAX
Superseded UAXes • Incorporated into and thus superseded by Unicode Version 4.0: • UAX #13: Unicode Newline Guidelines • UAX #19: UTF-32 • UAX #21: Case Mappings • UAX #27: Unicode 3.1 • UAX #28: Unicode 3.2
Unicode Character Database • Crucial Component of Unicode • Documentation coalesced into UCD.html. • New properties and values • Hangul_Syllable_Type, Unicode_Radical_Stroke • CJK numeric values added. • PropertyValueAliases adds block names • UCD fallback props more precisely defined. • for code points not explicitly in data files • New Characters • Appropriate properties assigned
UCD4.0 (cont.) • Modifier letters • The general category of 02B9..02BA, 02C6..02CF changed to general category Lm. • Khmer • Two Khmer characters are deprecated; four others strongly discouraged. • Decimal Digits • Numeric_Type=decimal digit now aligned with General_Category=Nd • Braille • Added script value
UCD4.0 (cont. 2) • Case Mapping • Fixed for Turkish, Lithuanian • Default Ignorables • Hangul Filler characters • Soft-Hyphen, CGJ, ZWS • Arabic End of Ayah and Syriac Abbreviation Mark no longer DI, shaping classes fixed. • Grapheme_Extend • removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)
Unicode Technical Standard • UTS: separate standard • independent conformance requirements • UTR: information and guidelines • Documents may move from UTR status to UTS
UTS #10: Unicode Collation • Significance: • String comparison, matching, searching • Compares all Unicode characters • Handles linguistic features • Accents, Case, Punctuation,… • Contextual weighting,… • Tailor for different languages • Version 4.0.0 due Sept. 2003 • From now on, to be sync'ed in repertoire and version with the Unicode Standard.
UTS #18: Regular Exp. • Significance: • Crucial to many applications: web, XML,… • Unicode adds significant requirements • Level 1: Basic Support • Perl • Level 2: Extended Support • Level 3: Tailored Support • New: • Recently approved as UTS (was UTR) • Adds clearer conformance requirements • Flexible list of features • Partial conformance claims
UTS #6: SCSU • Simple Unicode Compression • Added suitability for XML • See also Technical Note on BOCU • Main difference: preserves binary order • x < y => BOCU(x) < BOCU(y)
New UTRs • DraftUTR #23: Character Properties • Draft Character Property Model • Character Folding • Hiragana-Katakana, Case, … • Programming Language IDs, Syntax characters
Q& A • Other talks here: • Common Locale Data • interchange of language-specific data for sorting, dates, times, currencies • ICU • premier Unicode enablement library • full-featured, x-platform • C, C++, Java
Unicode 3.2 (March, 2002) • New Characters: 1,016 • Symbols • Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets. • Special Characters • combining grapheme joiner, word joiner, invisible operators for math, variation selectors • Modern Scripts • minority scripts of the Philippines
Conformance • Eliminates irregular UTF-8 • Defines variation sequences • Replaces ZWNBSP with Word Joiner • Clarifies scope of combining marks (further revised in 4.0) • Clarifications of conjoining jamo behavior, hangul syllable structure, decomposables,
Textual Clarifications • Combined vowels in Khmer, characters discouraged in Khmer • Use of dingbats
Unicode Standard Annexes • UAX #21: Case Mappings (was UTR)
Unicode Character Database • New properties: • IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph, • Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception • Grapheme_Base, Grapheme_Extend,Grapheme_Link • DerivedAge • Normalization Corrections • Added Property & Property Value Aliases • Adds StandardizedVariants.html
Related Items • UTS #10: Unicode Collation Algorithm • Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non-characters • Note: base version still U3.1 • UTR #26: CESU-8 • Unicode Technical Notes • Updated Character Encoding Stability Policy • Added Public Review process • Updated Glossary
Unicode 3.1 (March, 2001) • New Characters: 44,946 • First supplementaries encoded! • Modern scripts • CJK Ideographs (now totaling 71,039) • Historic scripts • Old Italic, Gothic, Deseret, Byzantine Musical Symbols • Symbols • Mathematical Alphanumeric Symbols, (Western) Musical Symbols
Conformance • Non-shortest-form UTF-8 excluded • Clarification of the stability of the standard, • code units vs. code points, non-characters, normative properties, informative properties, normative references • Revisions of guidelines: • wchar_t, unassigned code points, identifiers • Major revision of Georgian • Use of ZWNJ and ZWJ for ligatures • Language tag characters encoded • but discouraged
Unicode Standard Annexes • UAX #19: UTF-32
Unicode Character Database • Major revision of PropList properties: • White_Space, Bidi_Control, Join_Control, Hex_Digit • Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point • Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender • New properties: Case folding, Scripts • Added DerivedProperties, NormalizationTest
Related Items • Documented Character Encoding Stability Policy • UTS #10: Unicode Collation Algorithm • Merged data files; updated to base version 3.1 • UTR #18: Unicode Regular Expression Guidelines • UTR #20:Unicode in XML and other Markup Languages • UTR #22: Character Mapping Tables • UTR #24: Script Names
Schedule • 2003, April: UCD/UAXes • Final data files available • Implementation can proceed • 2003: September: • Book Available