390 likes | 723 Views
ISO/IEC 10646 & The Unicode ® Standard. Mike Ksar Senior Program Manager International Standards Strategy Microsoft Corporation JTC1/SC2/WG2 Convener Screenplay by Asmus Freytag. Outline. Background Relation between Unicode and ISO/10646 What is the same What is different
E N D
ISO/IEC 10646 &The Unicode® Standard Mike Ksar Senior Program Manager International Standards Strategy Microsoft Corporation JTC1/SC2/WG2 Convener Screenplay by Asmus Freytag
Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources • Beyond character coding • Character properties & Collation • Internationalization • Products and Standards • Summary
The Internet • The internet pushes the envelopeon internationalization • Users have easy access to documentsworldwide, in any character set • Servers can be accessed by users from anywhere, speaking any language • Software can no longer be targeted toa single national market • The need for a single character set standard was never greater. Why do we have two?
Common Charter Develop a standard of graphic character repertoire and coding for an international graphic character set ... of the written form of the languages of the world.
… and other National Bodies ANSI (US) ISO/IEC JTC 1: Information Technology INCITS: Information Technology NB SC 2: Codes and Character Sets L2: Codes, Character Sets, and Internationalization WG 2: ISO/IEC 10646 IRG: Ideographic Rapporteur Group Member SC 22: Programming Languages.. The Unicode Consortium WG 20: Internationalization UTC: Unicode Technical Committee Bidi and other subcommittees Liaison Organizations
ISO Framework • Basis for other standards: ISO, JTC1, ECMA, IETF, CEN/TC304 & W3C • Well established and recognized ISO development process of standardization • Worldwide expertise through national standards bodies, industry and liaison organizations • Identified as one of the standards for procurement requirements by major organizations and agencies
Unicode Framework • Consortium with open membership • Industry backing • Direct support from key implementers • Open to academic and user input • Cooperation with ISO, JTC1, ECMA, IETF, CEN/TC304 & W3C • Unicode Technical Committee (UTC)
Development Path ASCII ISO 646 . . . 10646 & Unicode ISO 8859-x Part-2 Part-... Part-1 . . . Industry 8-bit Codes IBM Other Windows . . . National/Industry Multibyte Codes 1 Universal Code
Sources of Characters • International standards • JTC1/SC2 coded character sets • JTC1/SC18 text formatting and presentation • ISO TC46 bibliographic community • National standards and committees • China (GB2312), Japan (JIS 208), Korea (KSC 5601) and many others • Widely supported vendor character sets • Regional standards committees • ASMO, ECMA ATG & Bidi & SC2/WG2/IRG • Liaison organizations: • Unicode, inc., ECMA, ITU-TS, AFII, TCA, W3C, CEN/TC304 and others • User communities • STIX
ISO/IEC 10646Milestones • 1984: ISO starts developing • 1991: Convergence with Unicode • 1993: ISO/IEC 10646-part 1, First edition • Architecture & Basic Multilingual Plane • Equivalent to Unicode 1.1 • 1998: ISO/IEC TR 15285An operational model for characters and glyphs • 1995 – 1999: Technical amendments • UTF-8, UTF-16, Korean, Tibetan, Braille, etc. • Unicode 2.0 is equivalent through amendment 7 • 2000: ISO/IEC 10646-1, Second edition • 3 technical corrigenda • 31 amendments since 10646-1: 1993 first edition • Equivalent to Unicode 3.0 • 2001: ISO/IEC 10646-2 for Planes 1, 2 & 14 • Unicode 3.1 includes repertoires of both 10646-1 and 10646-2 plus two additional characters • 2002: Amd-2 to part 1 • Equivalent to Unicode 3.2
Unicode 14 Years(1988-2002) • 1988: First use of name Unicode • 1991: Unicode Consortium founded • 1991: Unicode, Version 1 • 1991: First Implementers' Workshop • 1991: Convergence with ISO/IEC 10646 • Liaison to ISO/IEC 10646 Working Group • 1992: First Unicode Technical Reports • 1993: Unicode, Version 1.1 • 1996: Version 2.0 published • 2000: Version 3.0 published • Dramatic increase in number and scope of Unicode-based implementations • 2001: Version 3.1 published • 2002: Version 3.2 • 2002: 20th International Unicode Conference
Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged
Plane 16Private Use Plane 15Private Use Plane 14 . . . . . . Plane 02 . . . Plane 01 Plane 00 BMP Planes Code Space & Structure ISO/IEC 10646 Parts 1 and 2 • Only use code space in planes 0 to 16 • Define characters only in planes0 (BMP), 1, 2 & 14 so far • Reserve planes 15, 16for private use
A Plane in 10646 Plane (16-bits) • A plane is the basic division of code-space in ISO/IEC 10646 • The first plane (Plane 0) is the Basic Multi-lingual Plane (BMP) • Unicode 3.1 matches planes 0-16 Row Cell 65,536 characters
Basic Multilingual Plane C0 Controls C1 Controls Alphabets, Symbols, CJK Auxiliary, Hangul, . . . Unified Chinese, Japanese, Korean Ideographs Reserved for accessing code points outside BMP (2048) Private Use (6K), Compatibility Area, Arabic Presentation Forms, . . . (8190)
Adopted Form • ISO/IEC 10646 is a 16-bit or 32-bit code • UCS-2: for accessing code points in BMP, 2-bytes (16-bits) • UCS-4: canonical form for accessing any code point using 4-bytes (32-bits) • Transformation formats • UTF-8: for use in 8-bit environments (e.g. HTML, XML) (variable length code, 1 to 6 bytes/character) • UTF-16: for use with UCS-2 to access sixteen additional planes beyond the BMP Note: • Unicode 3.2 supports UTF-8, UTF-16 and UTF-32. • UTF-32 is equivalent to UCS-4, with an upper limit of 10FFFFx.
Implementation Levels • Implementation level for combining sequences • Level 1: only precomposed characters • Level 2: restricted combining sequences • Level 3: unrestricted combining sequences • Unicode has no formal restrictions on combining sequences • An implementation may choose to support a subset of characters which does not contain any or all combining characters
Collections for Subsets Collections of coded graphic characters The collections listed below are ordered by collection number. An * in the “positions” column indicates that the collection is a fixed collection. Collection number and name Positions 1 BASIC LATIN 0020 - 007E * 2 LATIN-1 SUPPLEMENT 00A0 - 00FF * 3 LATIN EXTENDED-A 0100 - 017F * 4 LATIN EXTENDED-B 0180 - 024F 5 IPA EXTENSIONS 0250 - 02AF 6 SPACING MODIFIER LETTERS 02B0 - 02FF Etc. • The Unicode declared subset is the whole of the BMP plus planes 1-16 accessible through UTF-16
Unicode Implements • BMP plus next 16 planes • Three encoding forms • UTF-8 • UTF-16 • UTF-32 (0 to 10FFFF) • Implementation level 3 • No subsets • Unicode encourages transparency so that implementations can at least retransmit every character undamaged, but the level of support is otherwise explicitly left to the implementation
Unicode - 10646 Relationship • ISO/IEC 10646 is a character encoding standard • Unicode is code for code compatible with ISO/IEC 10646 • Unicode defines additional specifications about behavior and use of characters such as bidi algorithm, ordering, mappings, equivalence algorithm and other semantics • Conformant implementations of Unicode are conformant implementations of ISO/IEC 10646
Unicode: Beyond 10646 In addition to character codes Unicode specifies: • Behavior and use of characters • A complete bidi algorithm • An equivalence algorithm • Normalization • Additional character properties and semantics for spacing, zero-width space, combining characters, numeric, case and casing, directionality, letters, math operators etc
Unicode: Beyond 10646 (Cont.) • Which combining marks are non-spacing marks • Order and use of double-diacritic non-spacing marks • A mapping for compatibility characters • Default shaping behavior of cursive scripts • Default mapping tables for conversion to and from other character set standards • Rendering for Indic characters • Line breaking
Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources
Continued Cooperation • Architecture changes: • UTF-32 (Proposed Amendment) • Restricts UCS-4 to planes 0 to 16 • Future editorial and technical corrigenda to second edition |of ISO/IEC 10646-1: 2000 (will be part of Unicode 3.2) • Repertoire extensions (included in Unicode 3.2) • ISO/IEC 10646-2 (planes 1, 2 & 14) • Plane 1, mathematics, hieroglyphs, music symbols, etc • Plane 2, CJKV ideographic extensions • Plane 14, language tags • Support current and future implementers • Increase awareness and provide technical help • Continued synchronization of future editions of ISO/IEC 10646 and the Unicode Standard
Going in the Same Direction • One standard • No dialects • Common usage • Common Encoding Forms • UTF-8 • UTF-16 • UTF-32/UCS-4 • Cooperation with ISO • Examples: UTF-8, UTF-16, UTF-32, EURO, collation, tags • Incorporation into other standards • IETF • WWW Consortium (W3C) • Shared expertise for lesser-used and obscure scripts
WG2 Program of Work • 1st Amendment 10646-1:2000 March 2002 • 2nd Amendment 10646-1: 2000 December 2002 • 1st Amendment 10646-2: 2001 2003 • WG2 future meetings: • Meeting 42 – Dublin, Ireland May 2002 • Meeting 43 – Tokyo, Japan December 2002
Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources • Beyond character coding • Character properties & Collation • Internationalization • Products and Standards
Collation & Character Properties • ISO/IEC 14651 Collation Standard • Produced by SC22/WG20 Internationalization • Matches Unicode Collation Algorithm • Unicode Technical Standard (UTS) #10 • Unicode Character Database • Collection of character classification and properties • Geared towards the needs of implementers • Supports Internationalization • http://www.unicode.org/Public/UNIDATA
Products Are Here! Types of Products Phase 2: Increased Functionality More Scripts, Combining Characters, etc. Full Set Phase 1: Deliver a full set of products Browsers, Development Tools, Fonts, Word Processors, etc. 2000 and beyond 1998- 1999 1997 1996 1995 1994 93 Increased Function of Products
Products Are HERE! • Microsoft: Windows XP, Office XP, Internet Explorer 6.0, ECMAScript, C#/CLI • Compaq: Tru64 Unix • HP: HP-UX & Printers • Netscape: communicator 6.0, JavaScript, ECMAScript • Sun: Solaris & Java • Apple: Cyberdog, Mac OS X • Lotus: Lotus Suite • Asian solutions: JustSystems (Ichitaro) and Star+Globe (MASS) • Databases: Software AG, Sybase, Oracle, DB2, NCR Teradata, Progress Software • SAP platform • Fonts: Adobe, Agfa/Monotype, Apple Advanced Typography, Bitstream, OpenType • Tools and libraries: several vendors
Version 3.2 Is Here! • Version 3.2 is in sync withboth parts of ISO/IEC 10646 and 1st amendment to 10646-1 • total repertoire of 95156 characters • completed math repertoire for MathML and other uses • Further restriction on ill-formed UTF-8 http://www.unicode.org/unicode/reports/tr28
Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources • Beyond character coding • Character properties & Collation • Internationalization • Products and Standards • Summary
Common Repertoire • The character repertoire of Unicode and ISO/IEC 10646 are exactly identical • Three matching encoding forms • There are minor differences in • Terminology • Publication format • Any conformant Unicode implementation conforms to ISO/IEC 10646
Unicode Extends... • Character semantics • “Discover and catalogue” • Canonical and compatibility equivalence • Relate characters to their established use • Technical reports with implementation guidelines • Normalization • Script behavior such as bi-directional algorithm • Active promotion of the standard
What Do 10646 and Unicode Do for You? • Global interoperability - write once run everywhere; One source code one binary with user installable/callable locales • Simplified software - one application with one code set versus multiple applications and managing different code sets • Data stability - A single common and widely adopted format • Reduced costs - development, maintenance, training
Great Expectations • Enhance global interoperability • Enhance data interchange • Permit easier development of localizable products • Reduce development cost of localized application software • Replace retrofitting with concurrent development
Recommendations • Buy the international standard (including all published amendments) as well as the Unicode standard • Watch for updates on the web including Unicode technical reports and ISO amendments • Join the Unicode consortium, W3C, your national body standards committee or other organization to influence standards development processes • Define your needs and communicate them to your vendors • Build products that support ISO/IEC 10646 and The Unicode Standard