1 / 56

MC419 Character Sets, Collations and National Languages in SQL Anywhere

MC419 Character Sets, Collations and National Languages in SQL Anywhere. Steven McDowell Senior Manager International and Sustaining Eng Sybase iAnywhere Solutions Steven.McDowell@sybase.com. Topics. national language support character sets collations character set conversion locales

audra
Download Presentation

MC419 Character Sets, Collations and National Languages in SQL Anywhere

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MC419 Character Sets, Collations and National Languages in SQL Anywhere • Steven McDowell • Senior Manager • International and Sustaining Eng • Sybase iAnywhere Solutions • Steven.McDowell@sybase.com

  2. Topics • national language support • character sets • collations • character set conversion • locales • programming tips

  3. Definitions • Localization (L10N) • translation of software and materials to another language or locale • Internationalization(I18N) • writing software to support more than one language • Multilingualization (M17N) • writing software to support more than one language at the same time • ASA is not multilingualized • Globalization (G11N) • ?

  4. Topics Supported languages

  5. National language support:Fully localized • Software • servers • development, administration and deployment tools • run-time messages • Documentation • on-line • books • Packaging

  6. National language support:Fully localized • EN - English • DE - German • FR - French • JA - Japanese

  7. National language support: Deployment localization • Software • run-time messages • servers • stand-alone tools • SQL Remote • No documentation • No development tools • Free!

  8. National language support: Deployment localization • ES - Spanish • IT - Italian • PL - Polish • PT - Portuguese (Brazilian) • KO - Korean • TW - Traditional Chinese (Taiwan, Hong Kong) • ZH - Simplified Chinese (People’s Republic of China)

  9. Topics Character sets

  10. Character set - definition • mapping between internal representation (hexadecimal) and a visual symbol (glyph) • usually specific to one or a few languages • may be more than one for a language

  11. Character sets • single-byte • multi-byte • Unicode

  12. Character sets - ANSI and OEM • ANSI (or ISO) character sets are international standards • used by Windows • OEM character sets are from DOS • used by command prompts (DOS boxes) in Windows • an application running on Windows might use either an ANSI or an OEM character set (or both)

  13. Character sets - single-byte • English, European, and some other languages can represent their entire set of characters using a single byte for each character • entire char set is only 256 bytes • standard C/C++ string processing does work

  14. Single-byte example - German • all chars are one byte • consider ö • in CP1252 (ANSI), its value is 0xE6 (246) • in CP850 (OEM), its value is 0x94 (148)

  15. Character sets - multi-byte • Asian languages have too many symbols to represent using one byte • one solution is “multi-byte” char sets, where each symbol is represented by one or more bytes • programs must be written carefully to handle these char sets! • standard C/C++ string processing does NOT work!

  16. Multi-byte example - Japanese • contains chars made of one or two bytes • “lead byte” of 0x81-0x9F, 0xE0-0xEF indicates two-byte char • “follow byte” is in range 0x40-0xFC • all other “lead byte” values indicate one-byte char

  17. Multi-byte example - Japanese • follow bytes include 7-bit ASCII values (0x40-7E)! • this includes the entire English alphabet, plus @ [ \ ] ^ _ ` { | } ~ • processing right-to-left is difficult • single-byte scanning is dangerous • filename processing on Windows is especially difficult because \ is a valid follow-byte

  18. Character sets - Unicode • universal character set • UCS-2 versus UTF-8 • wide characters (16 bits, 65536 chars) • can store all chars used by all languages • used by Java • Windows NT/2000 uses Unicode internally! • ODBC 3.51 supports Unicode

  19. Character sets - examples • OEM (DOS) • CP437, CP850 • ANSI (Windows) • CP1252 • OEM and ANSI • CP932 (Japanese)

  20. Difficult character sets • CP857 and CP920, Turkish • in Turkish, there are a number of accented letters that look like ‘I’ and ‘i’ • lower-case of ‘I’ is ‘ı’ • upper-case of ‘i’ is ‘İ’ • ‘I’ and ‘İ’ are NOT equivalent! • so ‘I’ and ‘i’ are NOT equivalent! • Yikes!!!!

  21. Difficult character sets • CP932 (Shift-JIS), Japanese, and CP950, Traditional Chinese • follow-bytes include 0x40-0x7E • A-Z a-z @ [ \ ] ^ _ { | } ~ • single-byte scanning for delimiters must be avoided • filename processing is a particular problem

  22. Difficult character sets • CP949, Korean • follow-bytes include 0x41-0x5A, 0x61-0x7A • A-Z and a-z

  23. Topics Collations

  24. Collation - definition • character set + ordering, or, character set + language (1250POL) • represents a particular language • >= 1 collations per character set (CP850) • >= 1 collations per language (English) • >= 1 languages per collation (1252LATIN1) • each database has one collation

  25. Collations - how they work • each character is mapped to a numeric value • one or more characters may map to the same numeric value • the numeric values are compared

  26. Collations - equivalent characters • in many languages, different glyphs can sorted as if they are the same • example: ‘a’ and ‘A’ in a case-insensitive database • example: ‘o’ and ‘ö’ in a German database • ordering within equivalent characters is random

  27. Collations - not supported • secondary sort attributes (sort ‘o’ before ‘ö’, but they compare equal) • ligatures (sort ‘ö’ as ‘oe’) • multi-byte: ordering chars with same lead byte • order by binary value of second byte • works OK for most multi-byte char sets • UTF-8 is a problem with this restriction

  28. Choosing a collation • default is based on system information about char set and language • SELECT PROPERTY( 'DefaultCollation' ) • collation can be specified explicitly

  29. Collations - examples • 1252LATIN1 • Code page 1252, Latin 1 ordering, Western Europe • 932JPN • Code page 932, Japanese multi-byte encoding • UTF8 • universal transformation format, 8-bit, Unicode multi-byte encoding

  30. Choosing a collation - Windows

  31. Choosing a collation - Unix

  32. Collations - identifiers • case is always preserved • identifiers are always case-insensitive • example: SYS.SYSDOMAIN table can be referenced as sys.sysdomain • exception: Turkish! • must specify identifiers with correct case, at least for ‘I’ or ‘i’

  33. Topics Character set conversion

  34. Character set conversion • server • ODBC translation DLL • ODBC Unicode support • jConnect/JDBC • application • MobiLink

  35. Server character set conversion • -ct command-line switch • client app specifies char set it wants • conversion occurs only if char set of client app is different from char set of collation

  36. ODBC Translation DLL • default is “off” (no translation) • standard MS DLL does OEM-ANSI conversion • not really needed anymore

  37. ODBC Unicode Support • ODBC 3.51 supports Unicode API • ASA ODBC driver supports Unicode API • ODBC driver manager converts all single- and multi-byte chars to/from Unicode and uses driver’s Unicode API

  38. jConnect/JDBC character set conversion • jConnect/JDBC determines database char set and translates characters between Unicode and DB char set

  39. Application character set conversion • only required if two parts of app require different char sets • example: database char set is 850 (OEM), app is windowed using 1252 (ANSI) • OemToAnsi/AnsiToOem • MultiByteToWideChar/WideCharToMultiByte • can and should be avoided, if possible!

  40. MobiLink character set conversion • upload and download takes place in char set of remote database (Unicode for Windows CE) • MobiLink communicates with consolidated database using ODBC’s Unicode API • data received from client is converted to Unicode • data sent to client is converted from Unicode to client’s char set

  41. Character set for connection strings • collation of database is not known until connection is established • no char set conversion occurs during connection • if all machines use same char set, then no problems • otherwise, must use “compatible” characters • safest to use 7-bit ASCII, English letters, digits and some special chars

  42. Topics Locales

  43. Locales • language • character set • collation label (used by server only when creating new database)

  44. Locales - language

  45. Locales - character set • examples: CP1252, CP932 • see documentation for complete list

  46. Locales - collation label • examples: 1252LATIN1, SJIS2 • see documentation for complete list

  47. SQLLOCALE environment variable • CS=charset;LANG=langcode;LABEL=label • force a language: set sqllocale=lang=de • force a char set: set sqllocale=cs=cp850 • only works if server is using -ct

  48. Topics Programming tips

  49. Internationalization • isolate all locale-dependent parts of your code • dates and times • currency and numeric formatting • carefully write all string processing code • many built-in string processing functions (strcmp, strlwr) cannot be used because they are not locale-aware • resources: strings, dialogs, menus, accelerators • lots of comments • documentation

  50. String resources • each string should consist of a complete thought • do not build sentences from phrases • substitution is OK, but allow for specifying position • be careful of maximum string length (German >> English!)

More Related