1 / 45

齊來探討 GNU/Linux 中文化

齊來探討 GNU/Linux 中文化. Let's Explore Chinese internationalization and localization on GNU/Linux!. 霍東靈,即時系統科研有限公司 Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002. 概覽 Overview. 中文字符集及編碼簡介 Introduction to Chinese charsets and encodings GB 18030-2000 和 HKSCS-2001

jerom
Download Presentation

齊來探討 GNU/Linux 中文化

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 齊來探討 GNU/Linux 中文化 Let's Explore Chinese internationalization and localization on GNU/Linux! 霍東靈,即時系統科研有限公司Anthony Fok, ThizLinux Laboratory Ltd.HKLUG Linux Talk, 13 April 2002

  2. 概覽 Overview • 中文字符集及編碼簡介Introduction to Chinese charsets and encodings • GB 18030-2000 和 HKSCS-2001 • GNU/Linux 系統上的中文 i18n/L10n 架構Chinese i18n/L10n infrastructure on GNU/Linux • 如何參與中文化的工作Participating in Chinese i18n/L10n • 待辦工作及未來展望Todo list and future developments

  3. 中文字符集及編碼簡介 Chinese character sets and encodings

  4. 在起初,只有 0 和 1In the beginning, there's only 0 and 1 • Computer sees all data as 0s and 1s • Each “on-off switch” unit is a “bit” (位元、比特) • 8-bits make up 1“byte”or“octet” (位元組、字節) • 0000 0000 to 1111 1111 (0x00 to 0xFF) make up 256 code points • Initially, each character is stored in 1 byte • ASCII (ISO 646 IRV) • ISO 8859-1 至 ISO 8859-16 (Latin1, Latin2, Greek, Hebrew, Thai, Cyrillic, etc.) • 256 codepoints is NOT enough for Chinese!

  5. 萬「碼」奔騰:眾多中文編碼標準So many charsets and encodings! • All Chinese (Han) characters that have ever existed exceeds 100,000 • Unicode 3.2 / ISO 10646 includes over 70,000 • CCCII includes over 75,000 • Invented in China; adopted by Japan, Korea, and Vietnam: “CJKV” • Sources include: • 漢語大字典 (Hanyu Da Zidian) • 康熙字典 (Kangxi Zidian) • Regional Standards (GB, CNS, HKSCS, JIS, KSC)

  6. 1 byte not enough? Let's use more! • If all bits are available: • 1 byte, 8 bits, 2^8 = 256 (0x00..0xFF) • 2 bytes, 16 bits, 2^16 = 65536 (0x0000..0xFFFF) • 3 bytes, 24 bits, 2^24 = 16,777,216 (0x000000..0xFFFFFF) • 4 bytes, 32 bits, 4,294,967,296 (0x00000000..0xFFFFFFFF) • Most legacy encodings must ensure ASCII compatibility, so cannot use all the space

  7. GB 2312-80 • GB2312 是中國大陸國家標準(國標) • 《信息技術─信息交換用漢字編碼字符集─基本集》, published in 1980 • 2-byte, {0xA1-0xFE}{0xA1-0xFE}, or 94x94, for a total of 8836 possible 2-byte codepoints. • 6500+ Han characters, for a total of 6700+ chars • Sidenote: GB 12345-T provides a Traditional Chinese charset encoded in the same space as GB 2312-80 • Called zh_CN.GB2312 or zh_CN.EUC-CN on GNU/Linux • Too few characters! (朱鎔基 -> 朱容基)

  8. GBK 規範 Specification • China actively participates in ISO 10646 • GB13000.1 = Unicode 2.1 (ISO 10646-1993) • Too many legacy GB2312 applications • Need a migration plan, an intermediate solution • GBK is the first step in that direction (1995) • Includes the repertoire of the CJK Unified Ideographs in GB13000.1 / Unicode 2.1 • U+4E00 to U+9FA5, over 20000 Han ideographs • Backward compatible with GB2312 • Implemented in Windows 95 (simp. Chin) (CP936) • {0x81-0xFE}{0x40-0x7E, 0x80-0xFE}

  9. Big-5 「五大碼」 • A “round-table” standard made up by the “Big-5” companies in Taiwan • Implemented by all major Chinese OS's • 倚天、零一、國喬、繁體中文 Windows 等等 • Not very well designed, 選字不夠規範 • Two characters are duplicated • Missing 「」 and other chars used in HK • In Taiwan, attempts to fix/extend Big5 basically failed (CMEX's Big-5+, Big-5E...)

  10. First steps beyond Big-5 • 倚天 ETen added some characters (Hirigana, Katagana, 「裏、銹」, etc. (Some call it Big5-ETen). De facto Big5 standard on GNU/Linux • Microsoft Code Page 950 includes 「裏、銹」 etc., but not all of ETen's extensions • User-Defined Areas (UDA), Vendor-Defined Areas (VDA), EUDC (End-User Defined Characters), Private User Areas (PUA) • Different people use EUDC differently... a messy situation • The demise of CMEX's Big-5+ standard

  11. Unicode / ISO 10646 • Unicode Consortium (Industry) • ISO/IEC 10646 (Academic/Int'l Standard) • The two join in their efforts to produce Unicode / UCS • Universal Multiple-Octet Coded Character Set • ISO: Design, adding characters to repertoire • Unicode Consortium: Technical implementation • Code range: U+0000 to U+10FFFF • 1,114,112 possible code points

  12. Unicode / ISO 10646 • Think “integers”: UCS2, UCS4 • Think “strings” • UTF-7 • UTF-8 • Variable width, 1 to 4 bytes (up to • UTF-16 • Fixed width 16-bit, with surrogates (U+D800-U+DFFF, high and low doubles up), up to U+10FFFF • UTF-32 • Fixed width 32-bit, up to U+7FFFFFFF

  13. Unicode / ISO 10646 • ISO 10646-1:1993 • ISO 10646-1:2000 • ISO 10646-2:2001 • Unicode 3.2 just came out • More world languages are being researched and added, a truly worldwide effort.

  14. 香港增補字符集-2001HKSCS-2001 • A brief history • GCCS (政府通用字庫 Government Common Character Set), 1995 • HKSCS-1999 • Official encoding name: BIG5-HKSCS (IANA Registry) • HKSCS-2001 • Actively promoted by ITSD • ITSD (HKSARG) wishes HKSCS-2001 to be implemented on GNU/Linux too, and actively assists the community by providing guidance and advice • Excellent official website, open standard(starts from http://www.digital21.gov.hk/eng/hkscs/

  15. 香港中文字範例Sample HKSCS Chinese Text • 大家好!你同我一齊玩! • 李、仔、魚涌、深水 • 大廈/有啊! • (仲好似有五個粗口字……) Hehe...

  16. GB 18030-2000 • GB 18030-2000 Standard • Rationale for a new standard: The 70207+ unified Han ideographs in Unicode 3.1 won't all fit in the 2-byte codespace of the GBK specification • 全名為《信息技術─信息交換用漢字編碼字符集─基本集的擴充》 (2000-03-17, 2000-11-30) • Further extends GBK to add 4-byte codespace • More than enough to cover U+0000 to U+10FFFF • Compatible with all future versions of ISO 10646 • Backward compatible with GB2312 and GBK

  17. GB 18030-2000 • Why is GB18030 significant? • It solves a pressing issue in China. Finally, all people's names, geographic names, and ancient text can be properly processed • It is mandatory: all operating systems sold after 2001-08-31 must support GB18030 • Products must pass GB18030 certification to ensure proper input, editing, screen display, and printing of GB18030 text • Thiz Linux Desktop was awarded A+ Grade in GB18030 Certification Test!

  18. GB 18030-2000 • 1-byte = ISO 646-IRV (US-ASCII) • {0x00-0x7F} • 2-byte =~ GBK • {0x81-0xFE}{0x40-0x7E} • 4-byte • Mapped linearly with Unicode while skipping all existing mappings • Can be calculated algorithmically • {0x81-0xFE}{0x30-0x39}{0x81-0xFE}{0x30-0x39)

  19. GB 18030-2000 • Official information hard to find • Hard to obtain the printed version of the GB18030 standard outside China • Fortunately, many early implementers and charsets experts have provided info: • Dirk Meyer (Adobe) translated the summary • Markus Scherer (IBM, Unicode Consortium) provides gb-18030-2000.xml conv. table • Many efforts and interests from others, including ThizLinux Laboratory

  20. UnicodeData.txt, Unihan.txt • UnicodeData.txt • Important information on the character repertoires and control codes in Unicode • Unihan.txt • Valuable information (attributes) of over 70,000 CJK Unified ideographs • Source • Pronunciations in CJKV (+ Cantonese and Mandarin) • Meaning

  21. 實施 HKSCS 和 GB18030 的難處 • HKSCS-2001 • CJK Extension B etc. (U+20000 – U+2FFFF), but not all programs support beyond U+FFFF yet • Lack of fonts • GB18030 • Huge! 4-byte • Certification • Fonts available, expensive (TrueType or bitmap) • Both are Unicode solutions, so as Unicode support improves, so will HKSCS and GB18030

  22. 其他中文編碼標準 • CCCII (Chinese Character Codes for Information Exchange) • http://public.ptl.edu.tw/publish/suyan/42/text_07.htm • CNS 11643 • Big-5+, Big-5E • 使用倉頡進行編碼 • And many more

  23. GNU/Linux 及 *BSD 中文化團隊 • CLE (Chinese GNU/Linux Extension) • A group of pioneering volunteers originally led by Platin (小虫) • Debian 中文計劃 • FreeBSD 中文化小組 • 中、港、台三地的翻譯團隊 • Many more CJKV teams and i18n/L10n worldwide, including Chinese and non-Chinese!

  24. 各大中文 GNU/Linux 發行版本Major Chinese GNU/Linux Distributions • 各大中文 GNU/Linux 發行版本 • 即時 Linux 桌面環境 6.0 (Thiz Linux Desktop 6.0) • Turbolinux 7.0 中文版 • 中文 2000 (Chinese 2000) • 沖浪 (Xteam)、 紅旗 (Red Flag)、中軟 (COSIX)、幸福 (Happy)、百資 (Linpus)、網虎 (XLinux) • 國外著名而有中文化的 GNU/Linux 發行版本 • Debian GNU/Linux, Red Hat Linux, Linux Mandrake, (SuSE, Slackware), FreeBSD

  25. GNU C Library (GLIBC) • Libc5 • Glibc 2.1 • Glibc 2.2 • Conversion tables • Big5 (CLE), GBK (Justin Yu, Sean Chen) • big5hkscs.c (Roger So, Ulrich Drepper, ThizLinux, James Su) • GB18030 (Wu Jian, Ulrich, ThizLinux, James Su, another version by Yu Shao)

  26. XFree86 / X 視窗系統X Window System • XFLD, fontset • Xrender / Xft (Keith Packard) • X-TT, “freetype” module • Addition of Big5-HKSCS encodings(Roger So) • Addition of GB18030 encoding(James Su et al.)

  27. GTK+ and GNOME • GNOME 1.x • Charset handling Based on Glibc and Xfree86 • Good, but not perfect • GNOME 2.0 (in development) • Pango • Xft

  28. Qt 3.0.4 and KDE 3.0.1 • Qt comes with its own “codecs” in order to be a multiplatform toolkit. • Somewhat tedious... the tables already created for Glibc must be re-created for Qt • except we cannot directly use Glibc's code because of licensing issues... No big deal, just extra efforts. • Good Unicode support; handles everything with Unicode internally. • Currently only supports UCS2, challenges for HKSCS-2001

  29. 中文輸入平台Chinese Input Method Servers • XCIN • Chinput • miniChinput • magicChinput • 楊春白雪 • MyIM

  30. 中文輸入法 • 倉頡 • 行列30 • 大易 • 五筆字型 • 智能ABC、智能拼音 • 混合 • Many others

  31. 中文字型Chinese fonts • 文鼎 • AR PL Mingti2L Big5 • AR PL SungtiL GB • AR PL KaitiM Big5 • AR PL KaitiM GB • 華康 • 方正 • 王漢忠十套 GNU GPL 中文字型 • 可惜格式不太合用……

  32. Web Browsers • Netscape 4.79 • Mozilla 0.9.9 • Dillo, Galeon, etc. • Konqueror

  33. CJK LaTeX and FreeType • CJK LaTeX Written by Werner Lemberg from Germany • Yes, Werner can speak Chinese too! Amazing! • FreeType 1.3.1 and FreeType 2.0.9: • TrueType (and Type1, BDF etc.) font library • Main authors: David Turner, Robert Wilhelm, Werner Lemberg

  34. PostScript 與 PDF • Ghostscript + CJK (GS-CJK) • Adobe's CMaps (HKscs, GBK2K, etc.) • Acrobat Reader 4.05 for Linux does not come with CMaps (HKscs and GBK2K) that are already in Acrobat Reader 5.0 • Ghostscript and XPDF are constantly improving

  35. Office Suites • OpenOffice.org family (Thiz Office, Kai Office, Red Office) • Chinese support improving, a joint effort • Excellent i18n/L10n support for all languages • HancomOffice • Will be based on Qt 3 • qbig5hkscscodec.cpp for Qt2 provided by ThizLinux Laboratory; Hancom ported the code for Qt3 • Lightweight: AbiWord and Gnumeric • Quite good too!

  36. 如何參與 GNU/Linux 中文化How to participate in i18n efforts • Improve existing infrastructure • Work on new areas • Help with localization and translation efforts • Join a project that you like, whether it is Chinese i18n/L10n related or not • Help spread the word! :-)

  37. PO 翻譯 • GNOME 2.0 • KDE 3.0 • GNU Utilities • Gettext 工具 • PO / MO 格式 • 用法、編碼 (Usage, encoding issues) • 寧可不譯,不可誤譯 • 「非化名的字型」 (平滑字型、反鋸齒字型)

  38. 參考網站 • http://cle.linux.org.tw/ • http://xcin.linux.org.tw/ • http://www.debian.org.hk/intl/zh/ • http://linuxfab.cx/ • http://www.linuxforum.net/ • http://www.unicode.org/ • 朱邦復先生工作室 http://www.cflabs.com/ • http://www.google.com/

  39. 待辦工作 / TODO • Some programs still need to be revised in order to conform to i18n/L10n infrastructure • Always room for improvement in terms of ease of use, completeness, and stability • More people's participations are welcome

  40. 未來發展Future Developments and Opportunities • 手寫板 Handwriting Pad • 語音識別 Voice Recognition • More smart Cantonese input methods? • IIIMF to replace XIM? • OpenType to replace TrueType? • More interesting Chinese language researches based on GNU/Linux systems?

  41. Comments and Suggestions • All skills are useful, even if you are not in CS, CE or EE! • Mathematics, Physics theory • C, C++, Perl, Python, GTK, Qt • IPA, Jyutping, Japanese, Korean... • e.g. XCIN 作者是讀 Physics... • 語言學 Linguistics, 語音學 Phonetics • What we can learn during the process • Skills development, learning English, learning other new languages, meeting friends, and many more!

  42. 歡迎任何問題! Questions? :-)

More Related