1 / 39

ISO/IEC 10646 & The Unicode ® Standard

ISO/IEC 10646 & The Unicode ® Standard. Mike Ksar Senior Program Manager International Standards Strategy Microsoft Corporation JTC1/SC2/WG2 Convener Screenplay by Asmus Freytag. Outline. Background Relation between Unicode and ISO/10646 What is the same What is different

lan
Download Presentation

ISO/IEC 10646 & The Unicode ® Standard

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISO/IEC 10646 &The Unicode® Standard Mike Ksar Senior Program Manager International Standards Strategy Microsoft Corporation JTC1/SC2/WG2 Convener Screenplay by Asmus Freytag

  2. Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources • Beyond character coding • Character properties & Collation • Internationalization • Products and Standards • Summary

  3. The Internet • The internet pushes the envelopeon internationalization • Users have easy access to documentsworldwide, in any character set • Servers can be accessed by users from anywhere, speaking any language • Software can no longer be targeted toa single national market • The need for a single character set standard was never greater. Why do we have two?

  4. Common Charter Develop a standard of graphic character repertoire and coding for an international graphic character set ... of the written form of the languages of the world.

  5. … and other National Bodies ANSI (US) ISO/IEC JTC 1: Information Technology INCITS: Information Technology NB SC 2: Codes and Character Sets L2: Codes, Character Sets, and Internationalization WG 2: ISO/IEC 10646 IRG: Ideographic Rapporteur Group Member SC 22: Programming Languages.. The Unicode Consortium WG 20: Internationalization UTC: Unicode Technical Committee Bidi and other subcommittees Liaison Organizations

  6. ISO Framework • Basis for other standards: ISO, JTC1, ECMA, IETF, CEN/TC304 & W3C • Well established and recognized ISO development process of standardization • Worldwide expertise through national standards bodies, industry and liaison organizations • Identified as one of the standards for procurement requirements by major organizations and agencies

  7. Unicode Framework • Consortium with open membership • Industry backing • Direct support from key implementers • Open to academic and user input • Cooperation with ISO, JTC1, ECMA, IETF, CEN/TC304 & W3C • Unicode Technical Committee (UTC)

  8. Development Path ASCII ISO 646 . . . 10646 & Unicode ISO 8859-x Part-2 Part-... Part-1 . . . Industry 8-bit Codes IBM Other Windows . . . National/Industry Multibyte Codes 1 Universal Code

  9. Sources of Characters • International standards • JTC1/SC2 coded character sets • JTC1/SC18 text formatting and presentation • ISO TC46 bibliographic community • National standards and committees • China (GB2312), Japan (JIS 208), Korea (KSC 5601) and many others • Widely supported vendor character sets • Regional standards committees • ASMO, ECMA ATG & Bidi & SC2/WG2/IRG • Liaison organizations: • Unicode, inc., ECMA, ITU-TS, AFII, TCA, W3C, CEN/TC304 and others • User communities • STIX

  10. ISO/IEC 10646Milestones • 1984: ISO starts developing • 1991: Convergence with Unicode • 1993: ISO/IEC 10646-part 1, First edition • Architecture & Basic Multilingual Plane • Equivalent to Unicode 1.1 • 1998: ISO/IEC TR 15285An operational model for characters and glyphs • 1995 – 1999: Technical amendments • UTF-8, UTF-16, Korean, Tibetan, Braille, etc. • Unicode 2.0 is equivalent through amendment 7 • 2000: ISO/IEC 10646-1, Second edition • 3 technical corrigenda • 31 amendments since 10646-1: 1993 first edition • Equivalent to Unicode 3.0 • 2001: ISO/IEC 10646-2 for Planes 1, 2 & 14 • Unicode 3.1 includes repertoires of both 10646-1 and 10646-2 plus two additional characters • 2002: Amd-2 to part 1 • Equivalent to Unicode 3.2

  11. Unicode 14 Years(1988-2002) • 1988: First use of name Unicode • 1991: Unicode Consortium founded • 1991: Unicode, Version 1 • 1991: First Implementers' Workshop • 1991: Convergence with ISO/IEC 10646 • Liaison to ISO/IEC 10646 Working Group • 1992: First Unicode Technical Reports • 1993: Unicode, Version 1.1 • 1996: Version 2.0 published • 2000: Version 3.0 published • Dramatic increase in number and scope of Unicode-based implementations • 2001: Version 3.1 published • 2002: Version 3.2 • 2002: 20th International Unicode Conference

  12. Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged

  13. Plane 16Private Use Plane 15Private Use Plane 14 . . . . . . Plane 02 . . . Plane 01 Plane 00 BMP Planes Code Space & Structure ISO/IEC 10646 Parts 1 and 2 • Only use code space in planes 0 to 16 • Define characters only in planes0 (BMP), 1, 2 & 14 so far • Reserve planes 15, 16for private use

  14. A Plane in 10646 Plane (16-bits) • A plane is the basic division of code-space in ISO/IEC 10646 • The first plane (Plane 0) is the Basic Multi-lingual Plane (BMP) • Unicode 3.1 matches planes 0-16 Row Cell 65,536 characters

  15. Basic Multilingual Plane C0 Controls C1 Controls Alphabets, Symbols, CJK Auxiliary, Hangul, . . . Unified Chinese, Japanese, Korean Ideographs Reserved for accessing code points outside BMP (2048) Private Use (6K), Compatibility Area, Arabic Presentation Forms, . . . (8190)

  16. Adopted Form • ISO/IEC 10646 is a 16-bit or 32-bit code • UCS-2: for accessing code points in BMP, 2-bytes (16-bits) • UCS-4: canonical form for accessing any code point using 4-bytes (32-bits) • Transformation formats • UTF-8: for use in 8-bit environments (e.g. HTML, XML) (variable length code, 1 to 6 bytes/character) • UTF-16: for use with UCS-2 to access sixteen additional planes beyond the BMP Note: • Unicode 3.2 supports UTF-8, UTF-16 and UTF-32. • UTF-32 is equivalent to UCS-4, with an upper limit of 10FFFFx.

  17. Implementation Levels • Implementation level for combining sequences • Level 1: only precomposed characters • Level 2: restricted combining sequences • Level 3: unrestricted combining sequences • Unicode has no formal restrictions on combining sequences • An implementation may choose to support a subset of characters which does not contain any or all combining characters

  18. Collections for Subsets Collections of coded graphic characters The collections listed below are ordered by collection number. An * in the “positions” column indicates that the collection is a fixed collection. Collection number and name Positions 1 BASIC LATIN 0020 - 007E * 2 LATIN-1 SUPPLEMENT 00A0 - 00FF * 3 LATIN EXTENDED-A 0100 - 017F * 4 LATIN EXTENDED-B 0180 - 024F 5 IPA EXTENSIONS 0250 - 02AF 6 SPACING MODIFIER LETTERS 02B0 - 02FF Etc. • The Unicode declared subset is the whole of the BMP plus planes 1-16 accessible through UTF-16

  19. Unicode Implements • BMP plus next 16 planes • Three encoding forms • UTF-8 • UTF-16 • UTF-32 (0 to 10FFFF) • Implementation level 3 • No subsets • Unicode encourages transparency so that implementations can at least retransmit every character undamaged, but the level of support is otherwise explicitly left to the implementation

  20. Unicode - 10646 Relationship • ISO/IEC 10646 is a character encoding standard • Unicode is code for code compatible with ISO/IEC 10646 • Unicode defines additional specifications about behavior and use of characters such as bidi algorithm, ordering, mappings, equivalence algorithm and other semantics • Conformant implementations of Unicode are conformant implementations of ISO/IEC 10646

  21. Unicode: Beyond 10646 In addition to character codes Unicode specifies: • Behavior and use of characters • A complete bidi algorithm • An equivalence algorithm • Normalization • Additional character properties and semantics for spacing, zero-width space, combining characters, numeric, case and casing, directionality, letters, math operators etc

  22. Unicode: Beyond 10646 (Cont.) • Which combining marks are non-spacing marks • Order and use of double-diacritic non-spacing marks • A mapping for compatibility characters • Default shaping behavior of cursive scripts • Default mapping tables for conversion to and from other character set standards • Rendering for Indic characters • Line breaking

  23. Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources

  24. Continued Cooperation • Architecture changes: • UTF-32 (Proposed Amendment) • Restricts UCS-4 to planes 0 to 16 • Future editorial and technical corrigenda to second edition |of ISO/IEC 10646-1: 2000 (will be part of Unicode 3.2) • Repertoire extensions (included in Unicode 3.2) • ISO/IEC 10646-2 (planes 1, 2 & 14) • Plane 1, mathematics, hieroglyphs, music symbols, etc • Plane 2, CJKV ideographic extensions • Plane 14, language tags • Support current and future implementers • Increase awareness and provide technical help • Continued synchronization of future editions of ISO/IEC 10646 and the Unicode Standard

  25. Going in the Same Direction • One standard • No dialects • Common usage • Common Encoding Forms • UTF-8 • UTF-16 • UTF-32/UCS-4 • Cooperation with ISO • Examples: UTF-8, UTF-16, UTF-32, EURO, collation, tags • Incorporation into other standards • IETF • WWW Consortium (W3C) • Shared expertise for lesser-used and obscure scripts

  26. WG2 Program of Work • 1st Amendment 10646-1:2000 March 2002 • 2nd Amendment 10646-1: 2000 December 2002 • 1st Amendment 10646-2: 2001 2003 • WG2 future meetings: • Meeting 42 – Dublin, Ireland May 2002 • Meeting 43 – Tokyo, Japan December 2002

  27. Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources • Beyond character coding • Character properties & Collation • Internationalization • Products and Standards

  28. Collation & Character Properties • ISO/IEC 14651 Collation Standard • Produced by SC22/WG20 Internationalization • Matches Unicode Collation Algorithm • Unicode Technical Standard (UTS) #10 • Unicode Character Database • Collection of character classification and properties • Geared towards the needs of implementers • Supports Internationalization • http://www.unicode.org/Public/UNIDATA

  29. Language Innovation

  30. Products Are Here! Types of Products Phase 2: Increased Functionality More Scripts, Combining Characters, etc. Full Set Phase 1: Deliver a full set of products Browsers, Development Tools, Fonts, Word Processors, etc. 2000 and beyond 1998- 1999 1997 1996 1995 1994 93 Increased Function of Products

  31. Products Are HERE! • Microsoft: Windows XP, Office XP, Internet Explorer 6.0, ECMAScript, C#/CLI • Compaq: Tru64 Unix • HP: HP-UX & Printers • Netscape: communicator 6.0, JavaScript, ECMAScript • Sun: Solaris & Java • Apple: Cyberdog, Mac OS X • Lotus: Lotus Suite • Asian solutions: JustSystems (Ichitaro) and Star+Globe (MASS) • Databases: Software AG, Sybase, Oracle, DB2, NCR Teradata, Progress Software • SAP platform • Fonts: Adobe, Agfa/Monotype, Apple Advanced Typography, Bitstream, OpenType • Tools and libraries: several vendors

  32. Version 3.2 Is Here! • Version 3.2 is in sync withboth parts of ISO/IEC 10646 and 1st amendment to 10646-1 • total repertoire of 95156 characters • completed math repertoire for MathML and other uses • Further restriction on ill-formed UTF-8 http://www.unicode.org/unicode/reports/tr28

  33. Outline • Background • Relation between Unicode and ISO/10646 • What is the same • What is different • What is being merged • Synchronization • Shared Process and Policies • Aligned Program of Work • Common publication resources • Beyond character coding • Character properties & Collation • Internationalization • Products and Standards • Summary

  34. Common Repertoire • The character repertoire of Unicode and ISO/IEC 10646 are exactly identical • Three matching encoding forms • There are minor differences in • Terminology • Publication format • Any conformant Unicode implementation conforms to ISO/IEC 10646

  35. Unicode Extends... • Character semantics • “Discover and catalogue” • Canonical and compatibility equivalence • Relate characters to their established use • Technical reports with implementation guidelines • Normalization • Script behavior such as bi-directional algorithm • Active promotion of the standard

  36. What Do 10646 and Unicode Do for You? • Global interoperability - write once run everywhere; One source code one binary with user installable/callable locales • Simplified software - one application with one code set versus multiple applications and managing different code sets • Data stability - A single common and widely adopted format • Reduced costs - development, maintenance, training

  37. Great Expectations • Enhance global interoperability • Enhance data interchange • Permit easier development of localizable products • Reduce development cost of localized application software • Replace retrofitting with concurrent development

  38. Recommendations • Buy the international standard (including all published amendments) as well as the Unicode standard • Watch for updates on the web including Unicode technical reports and ISO amendments • Join the Unicode consortium, W3C, your national body standards committee or other organization to influence standards development processes • Define your needs and communicate them to your vendors • Build products that support ISO/IEC 10646 and The Unicode Standard

  39. Thank You!

More Related