450 likes | 585 Views
齊來探討 GNU/Linux 中文化. Let's Explore Chinese internationalization and localization on GNU/Linux!. 霍東靈,即時系統科研有限公司 Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002. 概覽 Overview. 中文字符集及編碼簡介 Introduction to Chinese charsets and encodings GB 18030-2000 和 HKSCS-2001
E N D
齊來探討 GNU/Linux 中文化 Let's Explore Chinese internationalization and localization on GNU/Linux! 霍東靈,即時系統科研有限公司Anthony Fok, ThizLinux Laboratory Ltd.HKLUG Linux Talk, 13 April 2002
概覽 Overview • 中文字符集及編碼簡介Introduction to Chinese charsets and encodings • GB 18030-2000 和 HKSCS-2001 • GNU/Linux 系統上的中文 i18n/L10n 架構Chinese i18n/L10n infrastructure on GNU/Linux • 如何參與中文化的工作Participating in Chinese i18n/L10n • 待辦工作及未來展望Todo list and future developments
中文字符集及編碼簡介 Chinese character sets and encodings
在起初,只有 0 和 1In the beginning, there's only 0 and 1 • Computer sees all data as 0s and 1s • Each “on-off switch” unit is a “bit” (位元、比特) • 8-bits make up 1“byte”or“octet” (位元組、字節) • 0000 0000 to 1111 1111 (0x00 to 0xFF) make up 256 code points • Initially, each character is stored in 1 byte • ASCII (ISO 646 IRV) • ISO 8859-1 至 ISO 8859-16 (Latin1, Latin2, Greek, Hebrew, Thai, Cyrillic, etc.) • 256 codepoints is NOT enough for Chinese!
萬「碼」奔騰:眾多中文編碼標準So many charsets and encodings! • All Chinese (Han) characters that have ever existed exceeds 100,000 • Unicode 3.2 / ISO 10646 includes over 70,000 • CCCII includes over 75,000 • Invented in China; adopted by Japan, Korea, and Vietnam: “CJKV” • Sources include: • 漢語大字典 (Hanyu Da Zidian) • 康熙字典 (Kangxi Zidian) • Regional Standards (GB, CNS, HKSCS, JIS, KSC)
1 byte not enough? Let's use more! • If all bits are available: • 1 byte, 8 bits, 2^8 = 256 (0x00..0xFF) • 2 bytes, 16 bits, 2^16 = 65536 (0x0000..0xFFFF) • 3 bytes, 24 bits, 2^24 = 16,777,216 (0x000000..0xFFFFFF) • 4 bytes, 32 bits, 4,294,967,296 (0x00000000..0xFFFFFFFF) • Most legacy encodings must ensure ASCII compatibility, so cannot use all the space
GB 2312-80 • GB2312 是中國大陸國家標準(國標) • 《信息技術─信息交換用漢字編碼字符集─基本集》, published in 1980 • 2-byte, {0xA1-0xFE}{0xA1-0xFE}, or 94x94, for a total of 8836 possible 2-byte codepoints. • 6500+ Han characters, for a total of 6700+ chars • Sidenote: GB 12345-T provides a Traditional Chinese charset encoded in the same space as GB 2312-80 • Called zh_CN.GB2312 or zh_CN.EUC-CN on GNU/Linux • Too few characters! (朱鎔基 -> 朱容基)
GBK 規範 Specification • China actively participates in ISO 10646 • GB13000.1 = Unicode 2.1 (ISO 10646-1993) • Too many legacy GB2312 applications • Need a migration plan, an intermediate solution • GBK is the first step in that direction (1995) • Includes the repertoire of the CJK Unified Ideographs in GB13000.1 / Unicode 2.1 • U+4E00 to U+9FA5, over 20000 Han ideographs • Backward compatible with GB2312 • Implemented in Windows 95 (simp. Chin) (CP936) • {0x81-0xFE}{0x40-0x7E, 0x80-0xFE}
Big-5 「五大碼」 • A “round-table” standard made up by the “Big-5” companies in Taiwan • Implemented by all major Chinese OS's • 倚天、零一、國喬、繁體中文 Windows 等等 • Not very well designed, 選字不夠規範 • Two characters are duplicated • Missing 「」 and other chars used in HK • In Taiwan, attempts to fix/extend Big5 basically failed (CMEX's Big-5+, Big-5E...)
First steps beyond Big-5 • 倚天 ETen added some characters (Hirigana, Katagana, 「裏、銹」, etc. (Some call it Big5-ETen). De facto Big5 standard on GNU/Linux • Microsoft Code Page 950 includes 「裏、銹」 etc., but not all of ETen's extensions • User-Defined Areas (UDA), Vendor-Defined Areas (VDA), EUDC (End-User Defined Characters), Private User Areas (PUA) • Different people use EUDC differently... a messy situation • The demise of CMEX's Big-5+ standard
Unicode / ISO 10646 • Unicode Consortium (Industry) • ISO/IEC 10646 (Academic/Int'l Standard) • The two join in their efforts to produce Unicode / UCS • Universal Multiple-Octet Coded Character Set • ISO: Design, adding characters to repertoire • Unicode Consortium: Technical implementation • Code range: U+0000 to U+10FFFF • 1,114,112 possible code points
Unicode / ISO 10646 • Think “integers”: UCS2, UCS4 • Think “strings” • UTF-7 • UTF-8 • Variable width, 1 to 4 bytes (up to • UTF-16 • Fixed width 16-bit, with surrogates (U+D800-U+DFFF, high and low doubles up), up to U+10FFFF • UTF-32 • Fixed width 32-bit, up to U+7FFFFFFF
Unicode / ISO 10646 • ISO 10646-1:1993 • ISO 10646-1:2000 • ISO 10646-2:2001 • Unicode 3.2 just came out • More world languages are being researched and added, a truly worldwide effort.
香港增補字符集-2001HKSCS-2001 • A brief history • GCCS (政府通用字庫 Government Common Character Set), 1995 • HKSCS-1999 • Official encoding name: BIG5-HKSCS (IANA Registry) • HKSCS-2001 • Actively promoted by ITSD • ITSD (HKSARG) wishes HKSCS-2001 to be implemented on GNU/Linux too, and actively assists the community by providing guidance and advice • Excellent official website, open standard(starts from http://www.digital21.gov.hk/eng/hkscs/
香港中文字範例Sample HKSCS Chinese Text • 大家好!你同我一齊玩! • 李、仔、魚涌、深水 • 大廈/有啊! • (仲好似有五個粗口字……) Hehe...
GB 18030-2000 • GB 18030-2000 Standard • Rationale for a new standard: The 70207+ unified Han ideographs in Unicode 3.1 won't all fit in the 2-byte codespace of the GBK specification • 全名為《信息技術─信息交換用漢字編碼字符集─基本集的擴充》 (2000-03-17, 2000-11-30) • Further extends GBK to add 4-byte codespace • More than enough to cover U+0000 to U+10FFFF • Compatible with all future versions of ISO 10646 • Backward compatible with GB2312 and GBK
GB 18030-2000 • Why is GB18030 significant? • It solves a pressing issue in China. Finally, all people's names, geographic names, and ancient text can be properly processed • It is mandatory: all operating systems sold after 2001-08-31 must support GB18030 • Products must pass GB18030 certification to ensure proper input, editing, screen display, and printing of GB18030 text • Thiz Linux Desktop was awarded A+ Grade in GB18030 Certification Test!
GB 18030-2000 • 1-byte = ISO 646-IRV (US-ASCII) • {0x00-0x7F} • 2-byte =~ GBK • {0x81-0xFE}{0x40-0x7E} • 4-byte • Mapped linearly with Unicode while skipping all existing mappings • Can be calculated algorithmically • {0x81-0xFE}{0x30-0x39}{0x81-0xFE}{0x30-0x39)
GB 18030-2000 • Official information hard to find • Hard to obtain the printed version of the GB18030 standard outside China • Fortunately, many early implementers and charsets experts have provided info: • Dirk Meyer (Adobe) translated the summary • Markus Scherer (IBM, Unicode Consortium) provides gb-18030-2000.xml conv. table • Many efforts and interests from others, including ThizLinux Laboratory
UnicodeData.txt, Unihan.txt • UnicodeData.txt • Important information on the character repertoires and control codes in Unicode • Unihan.txt • Valuable information (attributes) of over 70,000 CJK Unified ideographs • Source • Pronunciations in CJKV (+ Cantonese and Mandarin) • Meaning
實施 HKSCS 和 GB18030 的難處 • HKSCS-2001 • CJK Extension B etc. (U+20000 – U+2FFFF), but not all programs support beyond U+FFFF yet • Lack of fonts • GB18030 • Huge! 4-byte • Certification • Fonts available, expensive (TrueType or bitmap) • Both are Unicode solutions, so as Unicode support improves, so will HKSCS and GB18030
其他中文編碼標準 • CCCII (Chinese Character Codes for Information Exchange) • http://public.ptl.edu.tw/publish/suyan/42/text_07.htm • CNS 11643 • Big-5+, Big-5E • 使用倉頡進行編碼 • And many more
GNU/Linux 及 *BSD 中文化團隊 • CLE (Chinese GNU/Linux Extension) • A group of pioneering volunteers originally led by Platin (小虫) • Debian 中文計劃 • FreeBSD 中文化小組 • 中、港、台三地的翻譯團隊 • Many more CJKV teams and i18n/L10n worldwide, including Chinese and non-Chinese!
各大中文 GNU/Linux 發行版本Major Chinese GNU/Linux Distributions • 各大中文 GNU/Linux 發行版本 • 即時 Linux 桌面環境 6.0 (Thiz Linux Desktop 6.0) • Turbolinux 7.0 中文版 • 中文 2000 (Chinese 2000) • 沖浪 (Xteam)、 紅旗 (Red Flag)、中軟 (COSIX)、幸福 (Happy)、百資 (Linpus)、網虎 (XLinux) • 國外著名而有中文化的 GNU/Linux 發行版本 • Debian GNU/Linux, Red Hat Linux, Linux Mandrake, (SuSE, Slackware), FreeBSD
GNU C Library (GLIBC) • Libc5 • Glibc 2.1 • Glibc 2.2 • Conversion tables • Big5 (CLE), GBK (Justin Yu, Sean Chen) • big5hkscs.c (Roger So, Ulrich Drepper, ThizLinux, James Su) • GB18030 (Wu Jian, Ulrich, ThizLinux, James Su, another version by Yu Shao)
XFree86 / X 視窗系統X Window System • XFLD, fontset • Xrender / Xft (Keith Packard) • X-TT, “freetype” module • Addition of Big5-HKSCS encodings(Roger So) • Addition of GB18030 encoding(James Su et al.)
GTK+ and GNOME • GNOME 1.x • Charset handling Based on Glibc and Xfree86 • Good, but not perfect • GNOME 2.0 (in development) • Pango • Xft
Qt 3.0.4 and KDE 3.0.1 • Qt comes with its own “codecs” in order to be a multiplatform toolkit. • Somewhat tedious... the tables already created for Glibc must be re-created for Qt • except we cannot directly use Glibc's code because of licensing issues... No big deal, just extra efforts. • Good Unicode support; handles everything with Unicode internally. • Currently only supports UCS2, challenges for HKSCS-2001
中文輸入平台Chinese Input Method Servers • XCIN • Chinput • miniChinput • magicChinput • 楊春白雪 • MyIM
中文輸入法 • 倉頡 • 行列30 • 大易 • 五筆字型 • 智能ABC、智能拼音 • 混合 • Many others
中文字型Chinese fonts • 文鼎 • AR PL Mingti2L Big5 • AR PL SungtiL GB • AR PL KaitiM Big5 • AR PL KaitiM GB • 華康 • 方正 • 王漢忠十套 GNU GPL 中文字型 • 可惜格式不太合用……
Web Browsers • Netscape 4.79 • Mozilla 0.9.9 • Dillo, Galeon, etc. • Konqueror
CJK LaTeX and FreeType • CJK LaTeX Written by Werner Lemberg from Germany • Yes, Werner can speak Chinese too! Amazing! • FreeType 1.3.1 and FreeType 2.0.9: • TrueType (and Type1, BDF etc.) font library • Main authors: David Turner, Robert Wilhelm, Werner Lemberg
PostScript 與 PDF • Ghostscript + CJK (GS-CJK) • Adobe's CMaps (HKscs, GBK2K, etc.) • Acrobat Reader 4.05 for Linux does not come with CMaps (HKscs and GBK2K) that are already in Acrobat Reader 5.0 • Ghostscript and XPDF are constantly improving
Office Suites • OpenOffice.org family (Thiz Office, Kai Office, Red Office) • Chinese support improving, a joint effort • Excellent i18n/L10n support for all languages • HancomOffice • Will be based on Qt 3 • qbig5hkscscodec.cpp for Qt2 provided by ThizLinux Laboratory; Hancom ported the code for Qt3 • Lightweight: AbiWord and Gnumeric • Quite good too!
如何參與 GNU/Linux 中文化How to participate in i18n efforts • Improve existing infrastructure • Work on new areas • Help with localization and translation efforts • Join a project that you like, whether it is Chinese i18n/L10n related or not • Help spread the word! :-)
PO 翻譯 • GNOME 2.0 • KDE 3.0 • GNU Utilities • Gettext 工具 • PO / MO 格式 • 用法、編碼 (Usage, encoding issues) • 寧可不譯,不可誤譯 • 「非化名的字型」 (平滑字型、反鋸齒字型)
參考網站 • http://cle.linux.org.tw/ • http://xcin.linux.org.tw/ • http://www.debian.org.hk/intl/zh/ • http://linuxfab.cx/ • http://www.linuxforum.net/ • http://www.unicode.org/ • 朱邦復先生工作室 http://www.cflabs.com/ • http://www.google.com/
待辦工作 / TODO • Some programs still need to be revised in order to conform to i18n/L10n infrastructure • Always room for improvement in terms of ease of use, completeness, and stability • More people's participations are welcome
未來發展Future Developments and Opportunities • 手寫板 Handwriting Pad • 語音識別 Voice Recognition • More smart Cantonese input methods? • IIIMF to replace XIM? • OpenType to replace TrueType? • More interesting Chinese language researches based on GNU/Linux systems?
Comments and Suggestions • All skills are useful, even if you are not in CS, CE or EE! • Mathematics, Physics theory • C, C++, Perl, Python, GTK, Qt • IPA, Jyutping, Japanese, Korean... • e.g. XCIN 作者是讀 Physics... • 語言學 Linguistics, 語音學 Phonetics • What we can learn during the process • Skills development, learning English, learning other new languages, meeting friends, and many more!
歡迎任何問題! Questions? :-)