1.85k likes | 2.07k Views
Internationalization: An Introduction. Presenter and Presentation. Addison Phillips Globalization Architect This Presentation “Internationalization and Unicode Conference” Tutorial Covers Internationalization and basic concepts, such as character encodings. Who is this guy?.
E N D
Internationalization: An Introduction
Presenter and Presentation • Addison Phillips • Globalization Architect • This Presentation • “Internationalization and Unicode Conference” Tutorial • Covers Internationalization and basic concepts, such as character encodings
Who is this guy? Globalization Architect, Lab126(you know us as “Amazon Kindle”) Chair, W3C Internationalization Core WG EditorIETF LTRU-WG
Internationalizationis: the design and development of a product that is enabled for target audiences that vary in culture, region, or language. [W3C] a fundamental architectural approach to software development
Related Concepts Localization: creation of a product tailored to a particular target market Translation: process of converting text from one language to another Globalization: unified approach to creating global products, especially those that support multiple geographies simultaneously
Opinions differ on capitalization (C12N);choose from: i18N I18n I18n I18N Very geeky; not very internationalized (I19G?) I N T E R N A T I O N A L I Z A T I O N I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 N I18N Localization = L10N Globalization = G11N Canonicalization = C14N Mystic Numbering (M4C N7G)
A Global Approach • Internationalization turns technical problems into business decisions • Balance priorities based on real user distribution/requirements • Consider global user population as a whole • Consider specific market requirements on an equal footing • Potential markets for the product
Buy In: The Key to Success • For internationalization to be a success over time, there must be commitment: • Management • Product Team • Development Team • All developers, not a splinter group
Globalized Product Development Internationalization turns technical problems into business decisions. • Localization: Choose which markets to translate user interface or documentation for with no engineering. • Deployment : Choose whether to serve applications from a single site, cluster of sites, or in each target market. • Development : Add content and features to products as necessary in each target market. • Integration and Interoperability: Servers and products can work together around the world, so customers can truly create “Enterprise” solutions.
Aspects of Internationalization Enabling—the same code supports multiple regions or cultures. Sometimes called a “global binary”. Externalization—plan for localizability by separating “content” from code. This makes localization for specific languages, regions, or cultures easy, fast, and cheap. Customization—add culturally specific functionality, presentation, or content to an application.
What, me worry? We (wrote it in Java/C#, used Unicode, etc.), so it is internationalized. We made the assumption that the product would only ever have English screens: all our users understand it anyway. A localized product is internationalized. An internationalized product is slow/slower. It takes longer to write internationalized code. We can’t read the screens/it is too hard to test. We have no intention of localizing, so no need to internationalize. We don’t have any customers there. The users in (some country) never complained, so it must work. This product is 100% fully internationalized.
Development Methodologies • Independent of development methodology • Agile? Waterfall? You make the choice. • Encompasses the full development cycle: • Design • Development • QC • Release • Support
The Customization Approach • “Internationalization is something remedial” • “Didn’t we do internationalization in the last release?!?” • Internationalization involves a lot of arcane knowledge (“we don’t know what to do”) • “It will interrupt or slow down development.” • “International features are not important to our U.S. customers—and they represent our largest market.” • “The guys in-country have always figured it out before.” • “Let’s outsource it” • “We’ll get to it next time”
International Branch functionality gaps: intl users waiting for 2.0i now Merges and Fixes Lots more peopleand cost 1.0i International Release 1.0 Lost $ and opportunitylots of cost to get there How That Model Really Looks bug fixes sexy new features 1.0 1.0a 2.0 Main Line Time
The Problem with Customization Code forks. (double, triple coding) Lag time for international releases. Non-adoption of localized release. Full regression of every language. Quality or commitment perception. Lack of data exchange between language versions. Difficult to repeat (every version is a repeat) Proliferation of bugs and of support problems. International features are cancelled. Core product still doesn’t work/can’t address similar markets. Loss of market share.
The Internationalization Approach Gather requirements globally Enable Externalize Customize Test and support globally Localize
Analyzing and Developing a Design Large Animal Pictures
Global Code Resources Large Animal Pictures Software Component Output Input I/O
Enterprise Animal Pictures clients API API Business Logic Business Logic Front End data feed Data Store API Business Logic Data Store Operating Env. partner or provider Operating Env.
Internationalization Issues • Text Processing • Character encodings, including Unicode, spelling, word breaks, collation, and so on • Language • Of the software (localization) • Of solutions built using the software (localizability, data) • Locale-affected formats • dates, numbers and the like • Regionally-affected formats • names, addresses, currency, and the like • Time-related issues • time zone, calendar, holidays, work rules and the like • Cultural adaptation • presentation, style, position, color use, and the like • Legal requirements • accessibility, SOX, DRM, moderation, security, content, and the like
“Well, it depends…” Making Good Design Decisions • Generalize designs • Locale independent data structures • Locale sensitive display • Externalize cultural or linguistic variations • Customize as a last resort
Levels of Enablement • Not Enabled • Single-Language-at-a-Time (SLAAT) All components run in the same language and encoding environment correctly. • Multi-Locale Unicode support; components run in different locales, languages, encodings, and time zones
Test Your Assumptions • Gender: • Male • Female
Enabling Making Code Aware of Culture
What is “enabling”? • Enabled software: adapts the display, processing, validation, storage, and transmission of data according to the cultural, linguistic, and regional needs of the users • Text, Characters, and Encodings • Locale Awareness • Times and Time Zones A “global binary” is a single object-code version that is used in all markets, regardless of localization.
The Biggest Source of Woe “Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.” ~Glen PerkinsGlobalization Architect
A lot of jargon Real and bogus jargon you might encounter: Real Jargon Multibyte Variable width Wide character Character encoding Coded character set Bidi or bidirectional Glyph, character, code unit Unicode Potentially Bogus Jargon kanji double-byte language extended ASCII ANSI encoding agnostic
How the computer sees the world “bits”: 010000010101101101101000 “byte” or “octet”: 01000001 (0x41) • code unit: a unit of physical storage and information interchange • represent numbers • come in various sizes (e.g. 7, 8, 16, 32, 64 bits) • how do we map text to the numbers used by computers?
… 0xC3 0x80 … From text to bits À Glyphs • A “glyph” is screen unit of text: it’s a picture of what users think of as a character. • A “grapheme” is a single visual unit of text. Characters • A “character” is a single logical unit of text. • A “character set” is aset of characters. • A “code point” is a number assigned to a character in a character set. • A “coded character set” is a character set where each character has a code point. Bytes • A “character encoding” maps a sequence of code points (“characters”) to a sequence of code units (such as bytes). • A “code unit” is a single logical unit of storage. U+00C0
Coded Character Set • Collection (repertoire) of characters, that is: a set. • Organized so that each character has a unique numeric (typically integer) value (code point). • Examples: • Unicode • ASCII (ANSI X3.4) • ISO 646 • JIS X 208 • Latin-1 (ISO 8859-1) Character sets are often associated with a particular language or writing system.
U+00C0 0xC3 0x80 Character Encoding • Maps a sequence of code points (characters) to a sequence of code units (e.g. bytes). • Some encodings use another unit instead of the byte. For example, some encodings use a 16-bit, 32-bit, or 64-bit code unit.
(usually the most important slide in this entire presentation) In memory, on disk, on the network, etc. All texthas a character encoding When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding.
Common Encoding Problems Mojibakegarbage characters Question Marks(conversion not supported) Tofuhollow boxes
Tofu Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example) Not usually a bug: it’s a display problem Can mask or masquerade as character corruption.
When Good Characters Go Bad Mojibake
Sources of Mojibake • View text using the wrong encoding • Apply a transfer encoding and forget to remove it • Convert to an encoding twice Convert to or from the wrong encoding Overzealous escaping Conversion to entities (“entitization”) Multiple conversions
ASCII • 7 bits = 27 = 128 characters • Enough for “U.S. English”
Latin-1(ISO 8859-1) ASCII for characters 0x00 through 0x7F Accented letters and other symbols 0x80 through 0xFF
One character—many encodings! char Cp1252 Cp437 Cp850 È 0xC8 ? 0xD4
Windows Code Pages Windows’s encodings (called “code pages”) are generally based on standard encodings—plus some additional characters. Example: • CP 1252 is based on ISO 8859-1, but includes 27 “extra” characters in the C1 control range (0x80-0x9F)
Originally an IBM character encoding term. IBM numbered their character sets with “CCSIDs” (coded character set ids) and numbered the corresponding character encodings as “code pages”. Microsoft borrowed code pages to create PC-DOS. Microsoft defines two kinds of code pages: “ANSI” code pages are the ones used by Windows GUI programs. “OEM” code pages are the ones used by command shell/command line programs. Neither “ANSI” nor “OEM” refer to a particular encoding standard or standards body in this context. Avoid the use of ANSI and OEM when referring to encodings. Code Page
Beyond Single Byte Encodings • So far we’ve been looking at single-byte encodings: • one byte per character • 1 byte = 1 character (= 1 glyph?) • 256 character maximum • Good enough for most alphabetic languages • Some languages need more characters. • What about the “double-byte” languages? • Don’t those take two bytes per character? 丏丣並 À
Escape sequences to select another character set Example: ISO 2022 uses escape sequences to select various encodings Use a larger code unit (“wide” character encoding) Example: IBM DBCS code pages or Unicode UTF-16 216 = 64K characters 232 = 4.2 billion characters Use a variable-width encoding Variable width encodings use different numbers of code units to represent different types of characters within the same encoding Methods of reaching beyond single-byte
One or more bytes per character 1 byte != 1 character May use 1, 2, 3, or 4 bytes per character May use shift or escape sequences May encode more than one character set In fact, single-byte encodings are a special case of multibyte! Multibyte Encoding: Any “variable-width” encoding that uses the byte as its code unit. Multibyte Encodings