240 likes | 404 Views
Unicode and Internationalization. Draft only meaningful with voiceover!. Internationalization Translation. localized = foreign language not required displayed text is translated all native conventions used dates, times, numbers, etc. display, editing, GUI
E N D
UnicodeandInternationalization Draft only meaningful with voiceover! mark.davis@us.ibm.com
Internationalization Translation • localized = foreign language not required • displayed text is translated • all native conventions used • dates, times, numbers, etc. • display, editing, GUI • internationalized = localizable w/o code changes mark.davis@us.ibm.com
Internationalization Levels • Different levels of support for different environments • Server Side • low-level • no display • Client Side • high-level • display and editing mark.davis@us.ibm.com
Server Side • Strings • storage and manipulation • character set conversion • collation, normalization, char/word boundaries • Locales • Formatting/parsing numbers, currencies, date/times, messages • Message cataloging (resources) mark.davis@us.ibm.com
Client Side • Displaying, printing and editing Unicode text. • BIDI display (Arabic, Hebrew…) • character shaping (Arabic, Indic,...) • Inputting text (Japanese) • Full incorporation into the windowing and desktop interface. mark.davis@us.ibm.com
Unicode • Key to modern internationalization • Enables robust interchange of text data • Encompasses all world characters • Supports legacy data mark.davis@us.ibm.com
Unicode Design Principles • Unambiguous • same code unit = same interpretation • Universal • all national standards, new extensions • Unicode 0041 FF21 = “AA” = SJIS 41 82 60 • Efficient • no code-switching (ISO 2022) mark.davis@us.ibm.com
Skipping Advantages • In the interests of time, we are skipping the details of Unicode advantages, and jumping on to... mark.davis@us.ibm.com
Not a Magic Wand • Code required for Client Side, Server Side • Complex languages require special support • Detecting “hotspots” • Rest of the discussion are items to watch for in XML mark.davis@us.ibm.com
Multiple Representations mark.davis@us.ibm.com
Endians • Big vs. Little: “a” = 00 61 vs 61 00 • UTF-16: BOM (FE FF vs. FF FE) • UTF-16BE, UTF-16LE mark.davis@us.ibm.com
Character Conversion • Many legacy sets • Names for sets not standard • IANA most accepted, but not comprehensive • JIS “¥” overloaded with “\” • Private Use Characters • “Best fit” mappings mark.davis@us.ibm.com
Ambiguous Term—“Character” • UTF-16: “Character” can mean: • Code Units: 16 bits • Code Point: 1 or 2 code units • Graphemes: 1+ code units • Combining sequences • Hangul Jamo • Indic clusters mark.davis@us.ibm.com
Comparison/Indexing • Index by which sense of character? • Canonical equivalence • Normalization mark.davis@us.ibm.com
Collation • Large character sets • Incompatible languages, versions • Weak Equivalents: “a” ~ “ä”, “a” “A” • Ignorable characters: “black-bird” • Contracting characters: “ch” • Expanding characters: “ä” • Separate key fields for phonetics mark.davis@us.ibm.com
Case Conversion • May be 1 to many: “ß” “SS” • May be locale-sensitive: “i” “İ” • Does not round-trip: “vederLa” mark.davis@us.ibm.com
Formatting/Parsing • Different separators: • “1’234,56”, “1,234.56” • Different order: • “2/23/99”, “23.2.99” • “Can’t find “ + X, X + “n’existe pas” • Different text: • “$”, “¥” mark.davis@us.ibm.com
Display: Orientation • Characters: left-right, right-left, top-bottom,… • Lines: top-bottom, left-right, right-left • Specials: Japanese Ruby, etc. • GUI: scrollbars, menus, etc. mark.davis@us.ibm.com
Display: Glyphs Characters • Shaping: “X” “Y” • contextual forms • ligatures • Indexing: m characters n glyphs mark.davis@us.ibm.com
Display: Editing • Editing: mapping glyph char indices • Line breaking • Justification • Hyphenation • Hanging punctuation • Optical alignment • Baseline alignment mark.davis@us.ibm.com
Input • Large character sets • Different keyboard mappings • Keys characters • Typing events contain strings, not just singletons • Input methods • GUI for options • interaction with text editing mark.davis@us.ibm.com
Current Issues for W3C • BOM • Use of “character” • Indexing/comparing • Normalization • Versions of Unicode (Euro,...) • Stateful format codes • Datatypes (date, time,…) • High-level layout (CSS…) mark.davis@us.ibm.com
Summary • Internationalization Translation • Unicode provides foundation • But not a magic wand! • Watch for hotspots • Work with int’l experts mark.davis@us.ibm.com