1 / 40

The HKIUG Unicode Project

Fourth Annual HKIUG Meeting 8-9 Dec, 2003 Lingnan University, Hong Kong. The HKIUG Unicode Project. Philip WONG, CityU Library HO Yee Ip, CUHK Library. Overview. Part I Background Problems Objective & Methodology Procedures Deliverables and Actions Part II Follow-up

vivek
Download Presentation

The HKIUG Unicode Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fourth Annual HKIUG Meeting 8-9 Dec, 2003 Lingnan University, Hong Kong The HKIUG Unicode Project Philip WONG, CityU Library HO Yee Ip, CUHK Library The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  2. Overview Part I • Background • Problems • Objective & Methodology • Procedures • Deliverables and Actions Part II • Follow-up • Are the problems solved • Future work The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  3. The HKIUG Unicode Project - Part I by Philip Wong City University of Hong Kong Library December 8, 2003 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  4. Background character sets • There are different character sets that support CJK. • Big5 is common in HK and Taiwan, GB is used in Mainland. • CCCII and EACC are mainly used in libraries. EACC is LC standard • Unicode is widely supported in OS, applications and W3C. Reference: KT Lam, “Overview of Chinese Character Encoding”, http://www.lib.cuhk.edu.hk/seminar/unicode/kt_lam_files/frame.htm The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  5. Background code points • Different character sets assigned different code points to the same character (more precisely, the same glyth) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  6. Background internal codes • Innovative supports CJK by storing the CJK internally in EACC and CCCII • The internal code is not Unicode based 100 1 |6880-01|aYu, Guangzhong,|d1928- 245 10 |6880-02|aYu Guangzhong shi xuan 880 1 |6100-01/$1|a余光中,|d1928- 880 10 |6245-02/$1|a余光中詩選 [edit mode ctrl-w] 100 1 |6880-01|aYu, Guangzhong,|d1928- 245 10 |6880-02|aYu Guangzhong shi xuan 880 1 |6100-01/$1|a{213131}{213272}{213034},|d1928- 880 10 |6245-02/$1 |a{213131}{213272}{213034}{21585c}{215c4f} The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  7. Background mapping table • Mapping table is required to convert internal codes to and from client encodings • Once a good solution, but also created many problems. • Many issues have been raised and discussed over the years • Seminar on Chinese Information Processing in Libraries, HKUST Jan 1998 • Good discussion list: LIB-CHINESE Listserv The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  8. Problems multiple mapping Problem 1 • Multiple mapping of internal codes to one client code • The code searched for or input to may not be the one desired • Order of mappings may be different among local sites, thus inconsistent results in Z39.50 searching • In III UTF-8 table, there are 1150 multiple mapping cases (2232 characters), including EACC and CCCII, some with high usage frequency. e.g.台 (U+53F0), 漢 (U+6F22) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  9. Problems overlapping eacc & cccii Problem 2 • In multi-mapping cases, there may be overlapping use of EACC and CCCII • Overlapping introduces more multiple mappings • Create workload when exchanging records with international bibliographic services which only accept EACC The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  10. Problems errors & missing Problem 3 • III mapping table contains other problems • In UTF-8 (Release 2002 Phase 3) • errors 27615F is mapped to U+53CB 友, it should be U+53D1 发 • missing cases 212F30 for U+3007 〇is missing • wrong types 213538 (U+53F0;台) is typed as non-EACC, it should be EACC The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  11. Problems analysis of UTF-8 Analysis done by local sites on UTF-8 mapping between April and June 2003 Questions: • Can preferences be selected by local sites for multiple mappings? • Can non-EACC codes be abandoned, those with EACC equivalents be converted to EACC in database? • Can correct type of EACC/CCCII be re-assigned based on standard? The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  12. Problems software inconsistency Problem 4 • What triggered the HKIUG Unicode Project is the inconsistent software mapping between Big5 and UTF-8 in multiple mapping cases: • Big5 client – mapped to the first entry • UTF-8 client – mapped to the last entry The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  13. Problems software inconsistency (cont) • Searching 才(cai) in WebPAC Big5 (or Telnet Big5) Mapped to the first InternalBig5 213f7b A47E 28736d A47E The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  14. Problems software inconsistency (cont) • Searching 才(cai) in WebPAC UTF-8 (or Millennium) Mapped to the last InternalUTF-8 213f7b 624D 28736d 624D The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  15. Objective & Methodology A seminar was organized by CUHK in July 2003 http://www.lib.cuhk.edu.hk/seminar/unicode/ A HKIUG Working Group on Unicode Project was formed. Members: CUHK, CityU, HKU, HKUST Objective • Solve software inconsistency between Big5 and UTF-8 • Decide on One-to-one mapping or Many-to-one mapping • Decide on Pure EACC or EACC and CCCII • Clean up errors, wrong types and missing cases • Prepare to transfer to Unicode based database The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  16. Objective & Methodology (cont) The working group further decided: • Not to fix Big5 table (small character set, support only traditional Chinese, more multiple mappings, …, etc.) • Propose a new UTF-8 mapping table to Innovative • For EACC mapping, follow LC standard • Allow multiple mappings of EACC; for unlinked cases, decide on the preferences • For multiple mappings of EACC and CCCII, remove the CCCII • Covert CCCII in database to EACC equivalents • Avoid missing characters, include pure CCCII (though low percentage in database) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  17. Procedures created diac.utf8.hkiug LC EACC 15739 EACC merged Subtracted 66 Substitutes for Missing (U+3013) 15673 EACC diac.utf8.hkiug 22717 EACC/CCCII + diac.utf8 7044 pure CCCII • Remapped 287 PUA • Selected preferences in multi-mapping linked and unlinked cases • Corrected LC mappings • prepared list for CCCII to EACC data conversion 7999 CCCII extracted Subtracted 955 with EACC equivalent The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  18. Procedures source from LC • Merged tables from LC's EACC to UCS/Unicode Mappings http://www.loc.gov/marc/specifications/specchareacc.html The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  19. Procedures source from diac.utf8 • Included pure CCCII from UTF-8 table (Rel 2002 Phase 3) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  20. Procedures re-mapped PUA • Re-mapped 297 Private User Area (PUA) to suggested alternates The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  21. Procedures selected preference • Selected preference in multiple mapping EACC The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  22. Procedures selected preference (cont) • Selected preference in EACC multiple mapping linked Linked cases: HKIUG preference indicated The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  23. Procedures selected preference (cont) • Selected preference in EACC multiple mapping unlinked Unlinked cases: HKIUG preference indicated The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  24. Procedures updated LC mapping • Updated LC mappings • Referenced from other sources Unihan OCLC USMARC Character Set for Chinese, Japanese, Korean (printed) • Examples: The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  25. Prepared list for data conversion Procedures list for conversion CCCII EACC CCCII with EACC Equivalents - for data conversion The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  26. Deliverables and Actions • Deliverables to Innovative • diac.utf8.hkiug - HKIUG version of UTF-8 mapping table EACC 15,673 Pure CCCII 7,044 Total 22,717 • hasEACC.txt - CCCII with EACC equivalents - 955 • Final Report - Hong Kong Innovative Users Group (HKIUG) III-UTF8 Working Group Report • Actions for Innovative • Endorse and install diac.utf8.hkiug • Replace CCCII listed in hasEACC.txt with their EACC equivalents in the database Note: local sites have the choice to implement the above actions or not (e.g. while adopting the new table, CUHK chose to run their own data conversion) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  27. The HKIUG Unicode Project - End of Part I The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  28. The HKIUG Unicode Project - Part II by Ho Yee Ip CUHK University Library Systems December 8, 2003 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  29. Are the problems solved • Resolve Big5 and UTF8 software inconsistency? • Yes (if abandon Big5 interfaces) • Use the same preferred mappings among local sites? • Yes (if all sites adopt the new table) • Able to search the desired code in multiple mapping? • Yes (if added entries are created) • No overlapping of EACC and CCCII in multiple mapping? • Yes • Clear up all errors and missing cases? • No (no-going job) • Switch 100% to Millennium? • No (unfortunately, 2002 Phase 3 created more problems …) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  30. Are the problems solved new problem • New problems in Release 2002 Phase 3 • In Millennium Edit, implicitly convert non preferred entries to the preferred entry (may be an old problem in Phase 2) • Worse, this “preferred” entry may not be the HKIUG preferred one. It is always mapped to the 2nd entry, which is wrong for multiple mappings > 2 • Testing • in Millennium Cataloguing, input 台in braced code {283B7D} • save record • check in telnet edit mode (Crt-W): still {283B7D} • re-save record in Millennium with no further editing • re-check in telnet: become {27542b} Note: Global update or amending attached records will not invoke this converting • Millennium not yet ready for CJK editing! The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  31. Are the problems solved installed sites • Report from sites who have installed the new UTF-8 mapping table and run the data conversion • successful? • failed? • unexpected outcome? The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  32. Follow-up mapping table • Continue to clean up and supplement the mapping table • Recommend updates and changes of EACC mapping to LC and III • There are 169 difference mappings between III and LC. HKIUG followed LC • Consider this case • III choice: 2D552E U+82FA 苺 • LC choice: 2D552E U+8393 莓 Obviously different • Consult: USMARC character set for Chinese, Japanese, Korean. Washington, D.C. : Library of Congress, 1986. • the glyth of 2D552E is 苺 (the same as III) • Is III right or LC right? • Others: • 232D42, 396B33, 23355C The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  33. Follow-up mapping table (cont) • Other differences between LC and III • 232D42 • III choice: 232D42 U+8842 衂 • LC choice: 232D42 U+4610 (2 dots)  minor variation • US MARC (printed): 232D42 衂(same as III) • 396B33 • III choice: 396B33 U+524F 剏 • LC choice: 396B33 U+5259 剙(2 dots) minor variation • US MARC (printed):396B33剏(same as III) • 23355C • III choice: 23355C U+8C63 豣 • LC choice: 23355C U+86C3 蛃  Obviously different • US MARC (printed): 23355C豣(same as III) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  34. Follow-up mapping table (cont) • Continue to clean up and supplement the mapping table • Supplement diac.utf8.hkiug with additional CCCII • source: Unihan database file latest data ( e.g. ftp://ftp.unicode.org/Public/4.0-Update1/Unihan-4.0.1d3b.zip) • Amend diac.utf8.hkiug when LC update its code standard • source: LC MARC 21 code standard (http://www.loc.gov/marc/specifications/specchareacc.html) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  35. Follow-up added entries • Change of cataloguing practice • Provide added entries for unlinked multi-mapping codes • Source data may not be the preferred code (by meaning) • Transcription should be faithful to the source • Added entries enhance retrieval e.g. 历U+5386 历 {274349} <=> 曆 {214349} 历 {27462A} preferred <=> 歷 {21462A} Source: 万年历 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  36. Follow-up added entries (cont) Source: 万年历 历 {274349} <=> 曆 {214349} 历 {27462A} preferred <=> 歷 {21462A} Action: • About 29 cases out of the 49 unlinked cases need attention The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  37. Follow-up staff mode • Since Big5 mapping table is not fixed, cannot use Telnet Big5 mode any more; explore software: AnzioWin, putty • In Telnet mode, INNOPAC UTF-8 port cannot support full screen editing, only line editing is feasible CJK display corrupted in full screen editing The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  38. Follow-up staff mode (cont) • For some local sites, e.g. CUHK, AnzioWin is used. When AnzioWin is set to CCCII mode, its mapping table CCCII.UNI can be used for Unicode mapping. • Deficiency: CCCII.UNI is one-to-one, non preferred entries cannot be included, e.g., # 274349 53D1 # not preferred 274C7B 53D1 • Better to use Innopac UTF-8 port when it is ready for editing The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  39. Future To migrate to pure Unicode environment…. • Abandoning EACC/CCCII will lose the linking of traditional, simplified and variant forms. • 历 U+5386 • 曆 U+66C6how to link? • 歷 U+6B77 • Linking information is available from Unihan website. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5386 Only if this linking is maintained by the vendor, migration can be considered. The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

  40. The HKIUG Unicode Project - The End {21387D} {215938} U+591A U+8B1D 多 謝 Thank You The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003

More Related