500 likes | 646 Views
Introduction to Character Encodings, Java and You. Agenda. Defining the problem Where webMethods products encounter character set problems. What the symptoms look like. Understand core concepts What is a character set? What’s an encoding? What is Unicode, really?
E N D
Agenda • Defining the problem • Where webMethods products encounter character set problems. • What the symptoms look like. • Understand core concepts • What is a character set? What’s an encoding? • What is Unicode, really? • Code Examples to avoid problems Private and Confidential
Confusion Reigns • Generally, the most confusing aspect of internationalization. • Many, many standards to choose from. • Arcane terminology • American programmers rarely (seem) to encounter it head-on. • We’re presenting this because many of our products are encountering this problem now. Private and Confidential
Problem Domain • webMethods products interface with: • non-Java systems (for example, in the adapters) • non-Java environments (file systems, databases, libraries, email, ftp, http, etc.). Private and Confidential
Java’s Text Representation • Java provides a convenient text processing architecture centered on the Java String object. • A Java String is basically an array of Java Character Objects. Private and Confidential
Java Characters • Each Java Character object represents a Unicode character. • (Currently) a 16-bit unsigned integer value between 0 and 65,535. • Character class provides access to character properties. • UPPER, lower, and Titlecase mapping • Comparison • Directionality • Compatibility • C-TYPE values such as ‘alpha-ness’, ‘digit-ness’, ‘alphanumeric-ness’ Private and Confidential
Non-Java Text • Non-Java files, applications, filesystems, database, et.al. typically do not use Unicode. Java sees them as an array of bytes (byte[]). Private and Confidential
Three Problems Private and Confidential
Bad Conversion • Target character set doesn’t have this character in it. Java replaces each character with a “?” • Input String: 日本語 • Output String: ??? • Typically: • Using the default encoding when we meant to specify one. • Writing on a device (such as System.out) whose legacy encoding doesn’t support the characters. Private and Confidential
“No Glyph” • Java knows what the character is and is handling it properly, but doesn’t have a picture of it to show you (in the current Font selected). • Input String: 日本語 • Output String: • Typically: • Nothing is wrong, just using the wrong Font. Private and Confidential
Random Trash • A byte[] was converted using the wrong character encoding. Bytes were mapped to the wrong characters. • Input String: 日本語 • Output String: ú{ê • Typically: • Using the wrong encoding, the underlying bytes are mapped to different, random-seeming characters. Private and Confidential
Examples • Same byte sequences, different results: Shift JIS byte[] = 0xE0, 0x41, 0x83, 0x70 = “漓パ” Latin-1 byte[] = 0xE0, 0x41, 0x83, 0x70 = “àAp” Java String = 0xE0, 0x41, 0x83, 0x70 = “荰” Java String = “漓パ” = U+6F13 U+30D1 Private and Confidential
What is a Character? • A character is a single, atomic unit of text. • The definition has a different meaning according to the writing system and context. Private and Confidential
Abstract characters • Some abstract characters include: A Roman Letter Capital A ` Combining Accent Grave に Hiragana character “ni” 語 CJK Ideograph ي Arabic letter 앚 Hangul syllable A Fullwidth compatibility letter A Private and Confidential
What is a Character Set? • A character set is a “set”--- a collection of characters, usually organized in some fashion. • You’re probably most familiar with ASCII: • 0x41 ‘A’ • 0x42 ‘B’ • Etc. Private and Confidential
What is a Character Encoding? • Character set: a collection of characters, basically, a bucket. • Character encoding: the specific ones and zeroes assigned to a character set. Character Set: ‘A’ == 0x41 Character Encoding: ‘A’ == 0x41 Private and Confidential
Eight Bit Encodings • 8-bit encodings allow for 256 characters. 128 ASCII 32 ‘C1’ controls 96 extended Private and Confidential
Latin-1 • The standard for Western Europe is generally ISO-8859-1 • AKA “Latin-1” • Used by UNIX systems and the Web. • Extended version used by Microsoft for Windows. Private and Confidential
Let a Thousand Encodings Bloom… • Each language has it’s own character set… • Everywhere: ASCII* • Western European (like German or French): Latin-1 • Eastern European (like Polish or Slovak): Latin-2 • Simplified Chinese: GB2312 Private and Confidential
Actually, many for each language… Private and Confidential
Other Writing Systems • Writing systems vary around the world (in order of increasing complexity, more or less): • Latin-based alphabets • (ABCDEFG…) English • Cyrillic and Greek-based alphabets • (АБВГДЕЖЩ...) Russian • Ideographic writing systems have thousands of characters • (一丁勺両亀困...) Japanese • Bi-directional (RTL) languages go right to left • (...זוהדגבא) Hebrew • Complex scripts (everything else): • (ऋऌऍऎ )Devanagari Private and Confidential
Expanded Character Sets • Most languages have alphabetic or phonetic writing systems: • Russian, Greek, Slavic, (many) Native American, Bahasa, Hebrew, Arabic, Semitic, etc.: alphabetic • Indian (subcontinent), Thai, Japanese kana, Korean: phonetic writing systems • 8 bits is enough for all of the above (with some tricks) • Some languages use scripts based on Chinese ideographic writing (“Han” or “Hanja”): • Chinese • Korean • Vietnamese (traditional) • Japanese Kanji Private and Confidential
“Double-Byte” • 8-bit character encodings use eight bits per character. • 28 = 255 characters • “Double-byte” character sets must be 2 bytes per character ? • 216 = 65,535 characters • Should actually be called “multi-byte” (MBCS). • Each character can be ONE, TWO, THREE and sometimes FOUR bytes in length. • MAY involve shift states. Private and Confidential
Multibyte Encodings A typical Japanese Character Set: JIS X 208 (漢字) Character Encodings of JIS X 208: Shift-JIS (CP932): 0x8A 0xBF 0x8E 0x9A EUC-JP: 0xB4 0xC1 0xBB 0xFA ISO 2022-JP: 0x1B, 0x24, 0x42, 0x34 0x41 0x3B 0x7A 0x1B 0x28 0x4A Non-Legacy: UTF-16: (0x6F22 0x5B57) Private and Confidential
An MBCS Example: Shift-JIS • Character set used by DOS, Windows, Macs, and a few UNIX-like systems for Japanese. • Code Page 932 • JIS X 208:1997 Private and Confidential
Shift-JIS • In order to reach more characters, double byte values start with a limited range of “lead bytes” • These can be followed by any character value> 0x40 (“trail byte”) Private and Confidential
Shift-JIS • Each “lead byte” provides a “window” onto additional characters. Private and Confidential
Shift-JIS • Problems: • Lead byte values are also valid as trail bytes. • Common special characters (“\”!!) are valid trail bytes. Private and Confidential
Han • CJK scripts require up to 100,000 unique characters for complete representation. • Four major variants: • Traditional Chinese • Simplified Chinese • Japanese Kanji • Korean (non-Hangul) Private and Confidential
“Kanji” • Sometimes you hear Japanese called “kanji” • Kanji is actually one of fourwriting systems used in Japan. • Kanji should be avoided as a generic term for DBCS. • Kanji (“Han” or Chinese writing): 日本語 • Hiragana (phonetic for Japanese words): にほんご • Katakana (phonetic for “foreign” words): ニホンゴ • Romanji (“Roman script”): nihongo Private and Confidential
Chinese • Upper two are Traditional. • Lower character is the Simplified variant. Private and Confidential
Hangul • Korean Hangul is a syllabic phonetic system, which has thousands of combinations. • Hangul is not related to Han ideographic writing. Private and Confidential
Code Page Hell • With hundreds of encodings and character sets to choose from, making internationalized code work in the late 1980’s and early 1990’s was “hellish”. • Internationalization folks referred to this as “code page hell” Private and Confidential
Unicode and Java To the Rescue
Unicode (ISO 10646-2) • Unicode is a character set that supports all of the world’s languages and writing systems.* • Originally designed as a “wide character set”--every character was represented by 16-bits. This allowed for 65,535 potential characters. • Extended to allow 1.1 million characters. • Unicode is maintained by an industry consortium. ISO 10646-2 is maintained by WG2. The two are exactly identical. Private and Confidential
It’s a character set? • Unicode is a character set. It has these encodings: • UTF-32. (BE/LE) • A 32-bit encoding. All characters 32 bits. • UTF-16. (BE/LE) • A 16-bit encoding. All characters are 16-bits. • Characters above 0xFFFF (the “Basic Multilingual Plane”) require two special “surrogate” characters. • UTF-8. • An 8-bit variable width encoding. Characters are 1, 2, 3 or 4 bytes long. Always non-endian. • ASCII == ASCII • All other characters have a special bit pattern Private and Confidential
UTF-8 Bit Pattern • ASCII == ASCII • 0x41 == ‘A’ • All other characters are multibyte. • 110xxxxx == two bytes • 1110xxxx == three bytes • 11110xxx == four bytes • 10xxxxxx == trail byte • U+00C0 == À == 0xC3 0x80 (11000011 10000000) Private and Confidential
Convenience Method for UTF8 • Almost True: readUTF and writeUTF allow direct access to UTF-8 DataInput/DataOutputStreams. • This is not really UTF-8, but a Sun specialized version. • Use InputStreamReader/OutputStreamWriter to do proper conversions. Private and Confidential
Java Uses Unicode • Every character in every Java String object is encoded as UTF-16 Unicode. • Every string is converted from a legacy encoding, either by the compiler or by the String class. • This is the reason for native2ascii and –encoding switches. • Once you have a String object, everything is Unicode UTF-16. Private and Confidential
“Special” encodings • There are two encodings that the system treats as special: • file.encoding • ISO-8859-1 • All basic conversion functions use your system default encoding. • Most servlet conversion functions use ISO-8859-1 as the default. Private and Confidential
Two File Encodings • Windows systems generally have two different file encodings: • “ANSI” encoding is the Windows default code page for GUI applications. • “OEM” encoding is the code page used by the ‘cmd’ or ‘command’ interpreter shells. Private and Confidential
Stream Readers and Writers • InputStreamReader and OutputStreamWriter classes perform controlled conversion between byte[] and String. • Always pass the encoding as a variable. • Use the IANA preferred name for the encoding, if possible (see ftp://ftp.isi.edu/in-notes/iana/assignments/) • Prefer UTF8 for on-the-wire transport. Private and Confidential
Code Sample // use with any type of InputStream class InputStream is = new FileInputStream(file); InputStreamReader isr = new InputStreamReader(is, encoding); // use Buffered Reader for efficiency BufferedReader br = new BufferedReader(isr); StringBuffer sb = new StringBuffer(); int chr; while ((chr = br.read() > -1) { sb.append(chr); } * Note: Try blocks eliminated for clarity. Private and Confidential
OutputStreamWriter Code Sample // use with any type of OutputStream class OutputStream os = new ByteArrayOutputStream(file); OutputStreamWriter osw = new OutputStreamWriter((OutputStream)os, encoding); osw.write(myString, 0, myString.length()); osw.flush(); * Note: Try blocks eliminated for clarity. Private and Confidential
Character Class • Provides access to Unicode character properties. • UnicodeBlock inside class • Character getType (defined types) • isDigit • isLetter • isLetterOrDigit • isUpperCase/isLowerCase/isTitleCase • toUpperCase/toLowerCase/toTitleCase • isSpace/isWhitespace • isISOControl/isJavaIdentifierStart/isJavaIdentiferPart Private and Confidential
Normalization • Many characters have two (or more) representations in Unicode. • Normalization makes the sequences the same. • Simplifies user input parsing and validation. Private and Confidential
ICUj Normalizer Class • Four forms of Normalization: • Form C (composed) • Form D (decomposed) • Form KC (canonical composed) • Form KD (canonical decomposed) • Special handling for Hangul characters! • Note that there is a private class java.text.Normalizer in the JDK. Private and Confidential
Demo Programs • UnicodeDemo – a Java program that demonstrates the byte sequences of different encodings and also provides some code that shows ISR and OSW in action. • Charsets – a Windows program by my buddy Bill Hall for playing with encodings. • http://www.inter-locale.com -- my personal website, with examples and demos of certain Java I18n things. Private and Confidential
Questions? Addison Phillips aphillips@webmethods.com