330 likes | 495 Views
Unicode 4.0. Mark Davis President, The Unicode Consortium. Schedule. 2003, April: UCD/UAXes Final data files available Implementation can proceed 2003: September: Book Available. New Characters: 1,228. Modern Scripts (additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac
E N D
Unicode 4.0 Mark Davis President, The Unicode Consortium
Schedule • 2003, April: UCD/UAXes • Final data files available • Implementation can proceed • 2003: September: • Book Available
New Characters: 1,228 • Modern Scripts • (additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac • (minority scripts) Limbu, Tai Le, Osmanya • Historic Scripts • Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers • Symbols • Monograms, digrams, tetragrams, other symbols • modifier & combining characters
New Characters (cont.) • Special Characters • additional variation selectors (for future CJK variants), double-diacritics for dictionary use • For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts. • Character repertoire corresponds to ISO/IEC 10646:2003.
Conformance • Substantially improved specification of conformance requirements • Incorporated UTR #17:Character Encoding Model, clearly separating encoding forms and encoding schemes • Tightened definitions of UTF-8, UTF-16, UTF-32 • Separate definition of Unicode String • Clarified conformance status of Unicode Standard Annexes • Formal definitions of properties & algorithms • Provisional properties: draft, NRFPT
UTF vs Unicode String • UTF • Unique representation for Code Point • All else illegal • C0 80 • D800 0061 • Unicode String • Sequence of code units • Internal Processing, not interchange • Not necessarily valid UTF • C0 A0 • D800 0061
Conformance (cont.) • Formalized policies for stability of the standard • Clarification of semantics of important characters, including BOM • Revised scope of enclosing combining marks • Revised semantics of ZWJ for cursive scripts • Normalization Corrections • U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF
Textual Clarifications • Major changes to Chapters 2, 3, 6, 14 and 15 • Definitive terminology for code points: • graphic, format, control, private-use • = assigned characters • surrogate, noncharacter, reserved • not characters • Substantial improvements to many character block descriptions, especially Indic
Programming language identifiers • Now backwards-compatible • Once a Unicode identifier, • Always a Unicode identifier • Alternate definition for complete stability • Fix set of allowed characters • Allow all reserved code points • + Complete stability • - “Odd” characters
Case mappings now normative (but tailorable) • Clearer definition of string functions: • isUpper(), isLower(), isTitle(), isFold() • toUpper(), toLower(), toTitle(), toFold() • Definition of titlecase uses word boundaries • Note that the Turkic mappings do not maintain canonical equivalence, without additional processing.
UAX #9:The Bidirectional Algorithm • canonically equivalence now preserved • data change, not algorithm • shaping is done after reordering • but not across directional boundaries • clarifications of: • ZWJ, ZWNJ • intermediate level processing
UAX #14: Line Breaking Properties • Negative numbers and dates with hyphens will not break across lines • Word-Joiner will link any characters (except hard line breaks) • Behavior of soft hyphen clarified • marks opportunity for breaking, not specific graphic appearance. • Rules for GL relaxed • SP and ZW override GL • New Property Values: NL, WJ
UAX #15: Unicode Normalization Forms • Description of Stable Code Points. • Notation NFC(x) and isNFC(x), in Notation. • Added pointer to UTN #5 Canonical Equivalences in Applications • Rewrote Annex 12: Corrigenda for clarity, and to describe the use of Normalization Corrections. • Added Annex 13: Canonical Equivalence.
UAX #29: Text Boundaries • New: extracted from 3.0, but significantly revised • Default definitions • Word, sentence: tailoring expected • Grapheme cluster (“user character”) • Hangul Syllable or other Base • plus (optionally) any number of NSMs
No Sub. Changes • UAX #11:East Asian Width • UAX #24: Script Names • except now UAX!
Superseded UAXes • Incorporated into and thus superseded by Unicode Version 4.0: • UAX #13: Unicode Newline Guidelines • UAX #19: UTF-32 • UAX #21: Case Mappings • UAX #27: Unicode 3.1 • UAX #28: Unicode 3.2
Unicode Character Database • Documentation coalesced into UCD.html. • New properties and values • Hangul_Syllable_Type, Unicode_Radical_Stroke • CJK numeric values added. • PropertyValueAliases adds block names • UCD fallback props more precisely defined. • for code points not explicitly in data files • New Characters • Appropriate properties assigned
UCD4.0 (cont.) • Modifier letters • The general category of 02B9..02BA, 02C6..02CF changed to general category Lm. • Khmer • Two Khmer characters are deprecated; four others strongly discouraged. • Decimal Digits • Numeric_Type=decimal digit now aligned with General_Category=Nd • Braille • Added script value
UCD4.0 (cont. 2) • Case Mapping • Fixed for Turkish, Lithuanian • Default Ignorables • Hangul Filler characters • Soft-Hyphen, CGJ, ZWS • Arabic End of Ayah and Syriac Abbreviation Mark no longer DI, shaping classes fixed. • Grapheme_Extend • removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)
Related Items • UTS #10: Unicode Collation Algorithm • Not part of Unicode 4.0, but closely related • From 4.0 on, to be sync'ed in repertoire and version with the Unicode Standard. • UTS #6: SCSU • Added suitability for XML • DraftUTS #18: Unicode Regular Expressions • Draft as UTS with conformance requirements • DraftUTR #23: Character Properties • Draft Character Property Model
Unicode 3.2 (March, 2002) • New Characters: 1,016 • Symbols • Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets. • Special Characters • combining grapheme joiner, word joiner, invisible operators for math, variation selectors • Modern Scripts • minority scripts of the Philippines
Conformance • Eliminates irregular UTF-8 • Defines variation sequences • Replaces ZWNBSP with Word Joiner • Clarifies scope of combining marks (further revised in 4.0) • Clarifications of conjoining jamo behavior, hangul syllable structure, decomposables,
Textual Clarifications • Combined vowels in Khmer, characters discouraged in Khmer • Use of dingbats
Unicode Standard Annexes • UAX #21: Case Mappings (was UTR)
Unicode Character Database • New properties: • IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph, • Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception • Grapheme_Base, Grapheme_Extend,Grapheme_Link • DerivedAge • Normalization Corrections • Added Property & Property Value Aliases • Adds StandardizedVariants.html
Related Items • UTS #10: Unicode Collation Algorithm • Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non-characters • Note: base version still U3.1 • UTR #26: CESU-8 • Unicode Technical Notes • Updated Character Encoding Stability Policy • Added Public Review process • Updated Glossary
Unicode 3.1 (March, 2001) • New Characters: 44,946 • First supplementaries encoded! • Modern scripts • CJK Ideographs (now totaling 71,039) • Historic scripts • Old Italic, Gothic, Deseret, Byzantine Musical Symbols • Symbols • Mathematical Alphanumeric Symbols, (Western) Musical Symbols
Conformance • Non-shortest-form UTF-8 excluded • Clarification of the stability of the standard, • code units vs. code points, non-characters, normative properties, informative properties, normative references • Revisions of guidelines: • wchar_t, unassigned code points, identifiers • Major revision of Georgian • Use of ZWNJ and ZWJ for ligatures • Language tag characters encoded • but discouraged
Unicode Standard Annexes • UAX #19: UTF-32
Unicode Character Database • Major revision of PropList properties: • White_Space, Bidi_Control, Join_Control, Hex_Digit • Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point • Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender • New properties: Case folding, Scripts • Added DerivedProperties, NormalizationTest
Related Items • Documented Character Encoding Stability Policy • UTS #10: Unicode Collation Algorithm • Merged data files; updated to base version 3.1 • UTR #18: Unicode Regular Expression Guidelines • UTR #20:Unicode in XML and other Markup Languages • UTR #22: Character Mapping Tables • UTR #24: Script Names