MC419 Character Sets, Collations and National Languages in SQL Anywhere

MC419 Character Sets, Collations and National Languages in SQL Anywhere • Steven McDowell • Senior Manager • International and Sustaining Eng • Sybase iAnywhere Solutions • Steven.McDowell@sybase.com

Topics • national language support • character sets • collations • character set conversion • locales • programming tips

Definitions • Localization (L10N) • translation of software and materials to another language or locale • Internationalization(I18N) • writing software to support more than one language • Multilingualization (M17N) • writing software to support more than one language at the same time • ASA is not multilingualized • Globalization (G11N) • ?

Topics Supported languages

National language support:Fully localized • Software • servers • development, administration and deployment tools • run-time messages • Documentation • on-line • books • Packaging

National language support:Fully localized • EN - English • DE - German • FR - French • JA - Japanese

National language support: Deployment localization • Software • run-time messages • servers • stand-alone tools • SQL Remote • No documentation • No development tools • Free!

National language support: Deployment localization • ES - Spanish • IT - Italian • PL - Polish • PT - Portuguese (Brazilian) • KO - Korean • TW - Traditional Chinese (Taiwan, Hong Kong) • ZH - Simplified Chinese (People’s Republic of China)

Topics Character sets

Character set - definition • mapping between internal representation (hexadecimal) and a visual symbol (glyph) • usually specific to one or a few languages • may be more than one for a language

Character sets • single-byte • multi-byte • Unicode

Character sets - ANSI and OEM • ANSI (or ISO) character sets are international standards • used by Windows • OEM character sets are from DOS • used by command prompts (DOS boxes) in Windows • an application running on Windows might use either an ANSI or an OEM character set (or both)

Character sets - single-byte • English, European, and some other languages can represent their entire set of characters using a single byte for each character • entire char set is only 256 bytes • standard C/C++ string processing does work

Single-byte example - German • all chars are one byte • consider ö • in CP1252 (ANSI), its value is 0xE6 (246) • in CP850 (OEM), its value is 0x94 (148)

Character sets - multi-byte • Asian languages have too many symbols to represent using one byte • one solution is “multi-byte” char sets, where each symbol is represented by one or more bytes • programs must be written carefully to handle these char sets! • standard C/C++ string processing does NOT work!

Multi-byte example - Japanese • contains chars made of one or two bytes • “lead byte” of 0x81-0x9F, 0xE0-0xEF indicates two-byte char • “follow byte” is in range 0x40-0xFC • all other “lead byte” values indicate one-byte char

Multi-byte example - Japanese • follow bytes include 7-bit ASCII values (0x40-7E)! • this includes the entire English alphabet, plus @ [ \ ] ^ _ ` { | } ~ • processing right-to-left is difficult • single-byte scanning is dangerous • filename processing on Windows is especially difficult because \ is a valid follow-byte

Character sets - Unicode • universal character set • UCS-2 versus UTF-8 • wide characters (16 bits, 65536 chars) • can store all chars used by all languages • used by Java • Windows NT/2000 uses Unicode internally! • ODBC 3.51 supports Unicode

Character sets - examples • OEM (DOS) • CP437, CP850 • ANSI (Windows) • CP1252 • OEM and ANSI • CP932 (Japanese)

Difficult character sets • CP857 and CP920, Turkish • in Turkish, there are a number of accented letters that look like ‘I’ and ‘i’ • lower-case of ‘I’ is ‘ı’ • upper-case of ‘i’ is ‘İ’ • ‘I’ and ‘İ’ are NOT equivalent! • so ‘I’ and ‘i’ are NOT equivalent! • Yikes!!!!

Difficult character sets • CP932 (Shift-JIS), Japanese, and CP950, Traditional Chinese • follow-bytes include 0x40-0x7E • A-Z a-z @ [ \ ] ^ _ { | } ~ • single-byte scanning for delimiters must be avoided • filename processing is a particular problem

Difficult character sets • CP949, Korean • follow-bytes include 0x41-0x5A, 0x61-0x7A • A-Z and a-z

Topics Collations

Collation - definition • character set + ordering, or, character set + language (1250POL) • represents a particular language • >= 1 collations per character set (CP850) • >= 1 collations per language (English) • >= 1 languages per collation (1252LATIN1) • each database has one collation

Collations - how they work • each character is mapped to a numeric value • one or more characters may map to the same numeric value • the numeric values are compared

Collations - equivalent characters • in many languages, different glyphs can sorted as if they are the same • example: ‘a’ and ‘A’ in a case-insensitive database • example: ‘o’ and ‘ö’ in a German database • ordering within equivalent characters is random

Collations - not supported • secondary sort attributes (sort ‘o’ before ‘ö’, but they compare equal) • ligatures (sort ‘ö’ as ‘oe’) • multi-byte: ordering chars with same lead byte • order by binary value of second byte • works OK for most multi-byte char sets • UTF-8 is a problem with this restriction

Choosing a collation • default is based on system information about char set and language • SELECT PROPERTY( 'DefaultCollation' ) • collation can be specified explicitly

Collations - examples • 1252LATIN1 • Code page 1252, Latin 1 ordering, Western Europe • 932JPN • Code page 932, Japanese multi-byte encoding • UTF8 • universal transformation format, 8-bit, Unicode multi-byte encoding

Choosing a collation - Windows

Choosing a collation - Unix

Collations - identifiers • case is always preserved • identifiers are always case-insensitive • example: SYS.SYSDOMAIN table can be referenced as sys.sysdomain • exception: Turkish! • must specify identifiers with correct case, at least for ‘I’ or ‘i’

Topics Character set conversion

Character set conversion • server • ODBC translation DLL • ODBC Unicode support • jConnect/JDBC • application • MobiLink

Server character set conversion • -ct command-line switch • client app specifies char set it wants • conversion occurs only if char set of client app is different from char set of collation

ODBC Translation DLL • default is “off” (no translation) • standard MS DLL does OEM-ANSI conversion • not really needed anymore

ODBC Unicode Support • ODBC 3.51 supports Unicode API • ASA ODBC driver supports Unicode API • ODBC driver manager converts all single- and multi-byte chars to/from Unicode and uses driver’s Unicode API

jConnect/JDBC character set conversion • jConnect/JDBC determines database char set and translates characters between Unicode and DB char set

Application character set conversion • only required if two parts of app require different char sets • example: database char set is 850 (OEM), app is windowed using 1252 (ANSI) • OemToAnsi/AnsiToOem • MultiByteToWideChar/WideCharToMultiByte • can and should be avoided, if possible!

MobiLink character set conversion • upload and download takes place in char set of remote database (Unicode for Windows CE) • MobiLink communicates with consolidated database using ODBC’s Unicode API • data received from client is converted to Unicode • data sent to client is converted from Unicode to client’s char set

Character set for connection strings • collation of database is not known until connection is established • no char set conversion occurs during connection • if all machines use same char set, then no problems • otherwise, must use “compatible” characters • safest to use 7-bit ASCII, English letters, digits and some special chars

Topics Locales

Locales • language • character set • collation label (used by server only when creating new database)

Locales - language

Locales - character set • examples: CP1252, CP932 • see documentation for complete list

Locales - collation label • examples: 1252LATIN1, SJIS2 • see documentation for complete list

SQLLOCALE environment variable • CS=charset;LANG=langcode;LABEL=label • force a language: set sqllocale=lang=de • force a char set: set sqllocale=cs=cp850 • only works if server is using -ct

Topics Programming tips

Internationalization • isolate all locale-dependent parts of your code • dates and times • currency and numeric formatting • carefully write all string processing code • many built-in string processing functions (strcmp, strlwr) cannot be used because they are not locale-aware • resources: strings, dialogs, menus, accelerators • lots of comments • documentation

String resources • each string should consist of a complete thought • do not build sentences from phrases • substitution is OK, but allow for specifying position • be careful of maximum string length (German >> English!)

MC419 Character Sets, Collations and National Languages in SQL Anywhere