Software Localization(L10N) and Internationalization(I18N)

Software Localization(L10N) and Internationalization(I18N) • Localization: customizing a software for a particular language/market • Class discussion: What are the things that needs to be customized when Microsoft Word need to be changed from English to Chinese?

Example: Good Morning Public class GoodMorning { Public static void main(String s[ ]) { System.out.println(“Good morning!”); } } • What if you want to do this for Hong Kong? • What if you want to do this for China and other places? • Think of a way to write it without the need to change the source code

Revised: Good Morning Import java.util.*; Public class GoodMorning { Public static void main(String s[ ]) { ResourceBundle resources; try { resources = ResourceBundle.getBundle(“MyData”); System.out.println(resources.getString(“Hi”); } catch (MissingResourceException mre) { System.err.println(“MyData.properties not found”);} }

Internationalization(I18N) • I18N: A software methodology to avoid writing separate application software for different language/cultural environments. • change of language environment without change of programming logic(no need to modify source code) • Why I18N: • More complicated software design and implementation • But Saving development cost for global market • Minimize localization • Minimize exposure of source code

Principles of I18N: • Do not hard-code any language related data/elements(language data) in a program • Design well defined Interface to access language data from external sources(files, databases, or even programs) • Clear instruction for localization

How to write an I18N program • analysis of language related elements in the application program and make sure they are not hard-coded in the program • Design/use language interface Specification (routines to access the language data in a well defined way) • Preparation of localization instructions(/follow standard) (mustbe precise so that data can be prepared following the instructions)

Example • Bank ATM machines in Hong Kong • Traditional program: • display alternate screens • Insert card and input password • get preferred display • If English, execute English program • else execute Chinese program • What do the English and the Chinese programs have in common? • What if we need to add Simplified Chinese?

I18N conscious program: • display alternate screens • Insert card and input password • get preferred display • open preferred display file • execute ATM program

Discussion on this example: public class GoodMorning { public static void main(String s[ ]) { int country = 0; if (s.equals(“English")) { country = 1; } else if (s.equals(“Chinese_HK")) { country = 2; } switch (country) { case 1: System.out.println(“Good Morning!”); break; case 2: System.out.println(“早上好！");break; default: System.out.println("Good Morning"); } } }

Data for User Interface vs. Data for manipulation • Data for User Interface: resource files • Data for manipulation may not be in the same language/script as the data displayed in the user interface. • Use an English UI of Window Word to compose a Chinese article or vice versa • Not necessarily in resource files

Language/culture Related Issues: Display & processing(basic to all applications) • Internal representation: codeset • Different classes of the subgroups in a codeset • Input: encoding of input strings to internal code • Output: internal code to glyph association(display) • Date expression • Currency symbols • Fraction& large numbers: • etc.

I18N Issues on Language Related Applications • Handling of messages in applications(not system msgs): • Writing the menu items and messages in resource files • providing a language parameter used in application or take the locale value to open the appropriate file(either in different directories), or use different file names. • Certain language specific Applications(e.g. spell checking): • Open it as an API so that different algorithms can be (dynamically) linked to the application • Data Format: • Example: Address - vary according to locations USA: Flat No.(incl. bldg), street, City, ZipCode(incl. State) HK: Flat, Floor, Bldg, Estate, Street(may be optional), District • Database table design is not straight forward.

Measurement scales: • Imperial system vs. metric system: can cause rounding problem • Paper sizes • Chinese language specific: • Segmentation • Lack of morphological rules to indicate tense(time), active/passive voice, etc. • No need for morphological rules in searching • More complicated sorting algorithm due to multiple features of Chinese characters

Internationalization Facilities POSIX • POSIX: Portable Operating System Interface • NLS: National Language Support • Locale: A particular localization setting C locale, zh_TW, etc /home/staff/csluqin:>date Thur Feb 24 15:38:25 CST 2005 :> setenv LANG zh_TW :> echo $LANG zh_TW :> /usr/openwin/lib/locale:>env LANG=zh_TW.BIG5 date 中華民國 94年 02月26日 15時38分 27秒 CST :> /usr/openwin/lib/locale:>env LANG=fr date mercredi, 3 avril 2002, 14:30:51 HKT(not available now)

Posix Locale categories LC_CTYPE: Controls the behavior of character handling functions, such as isalpha() LC_TIME: Date and time format and functions LC_MONETARY: Currency symbol, and functions etc LC_NUMERIC: Decimal separator and thousands separator LC_COLLATE: Control sorting order and string conversion/comparison LC_MESSAGES: Controls the choice of message catalogs(User message translation) :> env LANG=zh_TW LC_MESSAGE=c

Character class related test functions: isalpha( c ), isupper( c ), islower( c ), isdigit( c ), isxdigit( c ), isalnum( c ), isspace( c ), ispunct( c ), isprint( c ), iscntrl( c ), isascii( c ), isgraph( c ) • Character conversion functions: toupper( c ), tolower ( c ) • Wide character vs multi-byte characters • Wide character handling functions: mblen( c ), mbtowc( ), wctomb( ), mbstowcs( ), wcstombs( ) National Profile: data prepared for POSIX functions in a particular locale. Example of NP.GB

NLS and Symbolic Names • A National profile is written using symbolic names • Each locale has a separate file called charmap which maps the symbolic names of each character to the actual code of that locale Symbolic Name Encoding <A> \x41 <two> \x32 <semicolon> \x3b <GB16-01> \xb0\xa1 /*啊 • Why Symbolic names: • Less error prone • Flexibility • Language/cultural conventions different but the codeset is the same • Same language/cultural convention but different codesets

Making Portable software for different encodings (codeset independent) Char s[100]; char *p; fgets(s,sizeof(s), stdin); /* get a line of input*/ p = strchr(s,’A’); /* find letter A */ if (p != NULL) /* if found, */ *p = ‘\0’; /* replace with null byte*/ • What is the problem with this program? • ‘A’ in EUC encoding is fine: 0X41(Ascii code), but if this program is ported to a PC big5 system => second byte of an ideographic character • 乙丕再你杗呸服隹括耍唧涉…all the xx41 in Big5! • C language standard Guarantee: • 0X00 is not part of any MB character marking end of a string • Use of wide character

Wide characters vs. Multi-byte characters • They may be referring to the nature of codesets or it may refer to data types in programming languages • Multibyte characters: Character lengths vary from character to character, it can be referring to characters in a single codeset(Taiwan’s CNS), or characters in multiple codesets(Big5 with ASCII) such as char in C language • Wide characters: fixed-length character encoding such as wchar_t in C language, and characters in Java which are all unicode(wide characters)

Multibyte examples(Big5): 學習ABC => total of 7 bytes 學習普通話 => total of 10 bytes • Problems: • String length and byte length cannot be calculated directly(context sensitive). Detection of character boundary is needed. • Difficult to go to any position in a string to know if it is the first byte of a character or not • Need for conversion of MBC and WC

Conversion of MBC to WC • Note: Unicode is a WC, but WC is not necessarily Unicode

When to use MBC • Copy data only • Comparing for equality • Searching for control characters • Single byte data only: if MB_CUR_MAX = 1 • When to use WC • Collation: sorting • Parsing characters: searching and processing • String editing

MB_LEN_MAX, LC independent • MB_CUR_MAX, <stdlib.h> LC dependent • Use Wide characters Char s[100]; wchar_t ws[100]; size_t n; char *p; wchar_t *wcp; fgets(s,sizeof(s), stdin); /* get a line of input*/ mbstowcs(ws,s,100); /* convert s to ws */ wcp = wcschr(ws, mbtowc(’A’) ); /* find “A” */ ……..

Software Localization(L10N) and Internationalization(I18N)