400 likes | 524 Views
Fourth Annual HKIUG Meeting 8-9 Dec, 2003 Lingnan University, Hong Kong. The HKIUG Unicode Project. Philip WONG, CityU Library HO Yee Ip, CUHK Library. Overview. Part I Background Problems Objective & Methodology Procedures Deliverables and Actions Part II Follow-up
E N D
Fourth Annual HKIUG Meeting 8-9 Dec, 2003 Lingnan University, Hong Kong The HKIUG Unicode Project Philip WONG, CityU Library HO Yee Ip, CUHK Library The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Overview Part I • Background • Problems • Objective & Methodology • Procedures • Deliverables and Actions Part II • Follow-up • Are the problems solved • Future work The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
The HKIUG Unicode Project - Part I by Philip Wong City University of Hong Kong Library December 8, 2003 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Background character sets • There are different character sets that support CJK. • Big5 is common in HK and Taiwan, GB is used in Mainland. • CCCII and EACC are mainly used in libraries. EACC is LC standard • Unicode is widely supported in OS, applications and W3C. Reference: KT Lam, “Overview of Chinese Character Encoding”, http://www.lib.cuhk.edu.hk/seminar/unicode/kt_lam_files/frame.htm The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Background code points • Different character sets assigned different code points to the same character (more precisely, the same glyth) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Background internal codes • Innovative supports CJK by storing the CJK internally in EACC and CCCII • The internal code is not Unicode based 100 1 |6880-01|aYu, Guangzhong,|d1928- 245 10 |6880-02|aYu Guangzhong shi xuan 880 1 |6100-01/$1|a余光中,|d1928- 880 10 |6245-02/$1|a余光中詩選 [edit mode ctrl-w] 100 1 |6880-01|aYu, Guangzhong,|d1928- 245 10 |6880-02|aYu Guangzhong shi xuan 880 1 |6100-01/$1|a{213131}{213272}{213034},|d1928- 880 10 |6245-02/$1 |a{213131}{213272}{213034}{21585c}{215c4f} The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Background mapping table • Mapping table is required to convert internal codes to and from client encodings • Once a good solution, but also created many problems. • Many issues have been raised and discussed over the years • Seminar on Chinese Information Processing in Libraries, HKUST Jan 1998 • Good discussion list: LIB-CHINESE Listserv The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems multiple mapping Problem 1 • Multiple mapping of internal codes to one client code • The code searched for or input to may not be the one desired • Order of mappings may be different among local sites, thus inconsistent results in Z39.50 searching • In III UTF-8 table, there are 1150 multiple mapping cases (2232 characters), including EACC and CCCII, some with high usage frequency. e.g.台 (U+53F0), 漢 (U+6F22) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems overlapping eacc & cccii Problem 2 • In multi-mapping cases, there may be overlapping use of EACC and CCCII • Overlapping introduces more multiple mappings • Create workload when exchanging records with international bibliographic services which only accept EACC The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems errors & missing Problem 3 • III mapping table contains other problems • In UTF-8 (Release 2002 Phase 3) • errors 27615F is mapped to U+53CB 友, it should be U+53D1 发 • missing cases 212F30 for U+3007 〇is missing • wrong types 213538 (U+53F0;台) is typed as non-EACC, it should be EACC The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems analysis of UTF-8 Analysis done by local sites on UTF-8 mapping between April and June 2003 Questions: • Can preferences be selected by local sites for multiple mappings? • Can non-EACC codes be abandoned, those with EACC equivalents be converted to EACC in database? • Can correct type of EACC/CCCII be re-assigned based on standard? The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems software inconsistency Problem 4 • What triggered the HKIUG Unicode Project is the inconsistent software mapping between Big5 and UTF-8 in multiple mapping cases: • Big5 client – mapped to the first entry • UTF-8 client – mapped to the last entry The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems software inconsistency (cont) • Searching 才(cai) in WebPAC Big5 (or Telnet Big5) Mapped to the first InternalBig5 213f7b A47E 28736d A47E The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Problems software inconsistency (cont) • Searching 才(cai) in WebPAC UTF-8 (or Millennium) Mapped to the last InternalUTF-8 213f7b 624D 28736d 624D The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Objective & Methodology A seminar was organized by CUHK in July 2003 http://www.lib.cuhk.edu.hk/seminar/unicode/ A HKIUG Working Group on Unicode Project was formed. Members: CUHK, CityU, HKU, HKUST Objective • Solve software inconsistency between Big5 and UTF-8 • Decide on One-to-one mapping or Many-to-one mapping • Decide on Pure EACC or EACC and CCCII • Clean up errors, wrong types and missing cases • Prepare to transfer to Unicode based database The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Objective & Methodology (cont) The working group further decided: • Not to fix Big5 table (small character set, support only traditional Chinese, more multiple mappings, …, etc.) • Propose a new UTF-8 mapping table to Innovative • For EACC mapping, follow LC standard • Allow multiple mappings of EACC; for unlinked cases, decide on the preferences • For multiple mappings of EACC and CCCII, remove the CCCII • Covert CCCII in database to EACC equivalents • Avoid missing characters, include pure CCCII (though low percentage in database) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures created diac.utf8.hkiug LC EACC 15739 EACC merged Subtracted 66 Substitutes for Missing (U+3013) 15673 EACC diac.utf8.hkiug 22717 EACC/CCCII + diac.utf8 7044 pure CCCII • Remapped 287 PUA • Selected preferences in multi-mapping linked and unlinked cases • Corrected LC mappings • prepared list for CCCII to EACC data conversion 7999 CCCII extracted Subtracted 955 with EACC equivalent The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures source from LC • Merged tables from LC's EACC to UCS/Unicode Mappings http://www.loc.gov/marc/specifications/specchareacc.html The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures source from diac.utf8 • Included pure CCCII from UTF-8 table (Rel 2002 Phase 3) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures re-mapped PUA • Re-mapped 297 Private User Area (PUA) to suggested alternates The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures selected preference • Selected preference in multiple mapping EACC The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures selected preference (cont) • Selected preference in EACC multiple mapping linked Linked cases: HKIUG preference indicated The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures selected preference (cont) • Selected preference in EACC multiple mapping unlinked Unlinked cases: HKIUG preference indicated The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Procedures updated LC mapping • Updated LC mappings • Referenced from other sources Unihan OCLC USMARC Character Set for Chinese, Japanese, Korean (printed) • Examples: The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Prepared list for data conversion Procedures list for conversion CCCII EACC CCCII with EACC Equivalents - for data conversion The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Deliverables and Actions • Deliverables to Innovative • diac.utf8.hkiug - HKIUG version of UTF-8 mapping table EACC 15,673 Pure CCCII 7,044 Total 22,717 • hasEACC.txt - CCCII with EACC equivalents - 955 • Final Report - Hong Kong Innovative Users Group (HKIUG) III-UTF8 Working Group Report • Actions for Innovative • Endorse and install diac.utf8.hkiug • Replace CCCII listed in hasEACC.txt with their EACC equivalents in the database Note: local sites have the choice to implement the above actions or not (e.g. while adopting the new table, CUHK chose to run their own data conversion) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
The HKIUG Unicode Project - End of Part I The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
The HKIUG Unicode Project - Part II by Ho Yee Ip CUHK University Library Systems December 8, 2003 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Are the problems solved • Resolve Big5 and UTF8 software inconsistency? • Yes (if abandon Big5 interfaces) • Use the same preferred mappings among local sites? • Yes (if all sites adopt the new table) • Able to search the desired code in multiple mapping? • Yes (if added entries are created) • No overlapping of EACC and CCCII in multiple mapping? • Yes • Clear up all errors and missing cases? • No (no-going job) • Switch 100% to Millennium? • No (unfortunately, 2002 Phase 3 created more problems …) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Are the problems solved new problem • New problems in Release 2002 Phase 3 • In Millennium Edit, implicitly convert non preferred entries to the preferred entry (may be an old problem in Phase 2) • Worse, this “preferred” entry may not be the HKIUG preferred one. It is always mapped to the 2nd entry, which is wrong for multiple mappings > 2 • Testing • in Millennium Cataloguing, input 台in braced code {283B7D} • save record • check in telnet edit mode (Crt-W): still {283B7D} • re-save record in Millennium with no further editing • re-check in telnet: become {27542b} Note: Global update or amending attached records will not invoke this converting • Millennium not yet ready for CJK editing! The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Are the problems solved installed sites • Report from sites who have installed the new UTF-8 mapping table and run the data conversion • successful? • failed? • unexpected outcome? The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up mapping table • Continue to clean up and supplement the mapping table • Recommend updates and changes of EACC mapping to LC and III • There are 169 difference mappings between III and LC. HKIUG followed LC • Consider this case • III choice: 2D552E U+82FA 苺 • LC choice: 2D552E U+8393 莓 Obviously different • Consult: USMARC character set for Chinese, Japanese, Korean. Washington, D.C. : Library of Congress, 1986. • the glyth of 2D552E is 苺 (the same as III) • Is III right or LC right? • Others: • 232D42, 396B33, 23355C The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up mapping table (cont) • Other differences between LC and III • 232D42 • III choice: 232D42 U+8842 衂 • LC choice: 232D42 U+4610 (2 dots) minor variation • US MARC (printed): 232D42 衂(same as III) • 396B33 • III choice: 396B33 U+524F 剏 • LC choice: 396B33 U+5259 剙(2 dots) minor variation • US MARC (printed):396B33剏(same as III) • 23355C • III choice: 23355C U+8C63 豣 • LC choice: 23355C U+86C3 蛃 Obviously different • US MARC (printed): 23355C豣(same as III) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up mapping table (cont) • Continue to clean up and supplement the mapping table • Supplement diac.utf8.hkiug with additional CCCII • source: Unihan database file latest data ( e.g. ftp://ftp.unicode.org/Public/4.0-Update1/Unihan-4.0.1d3b.zip) • Amend diac.utf8.hkiug when LC update its code standard • source: LC MARC 21 code standard (http://www.loc.gov/marc/specifications/specchareacc.html) The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up added entries • Change of cataloguing practice • Provide added entries for unlinked multi-mapping codes • Source data may not be the preferred code (by meaning) • Transcription should be faithful to the source • Added entries enhance retrieval e.g. 历U+5386 历 {274349} <=> 曆 {214349} 历 {27462A} preferred <=> 歷 {21462A} Source: 万年历 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up added entries (cont) Source: 万年历 历 {274349} <=> 曆 {214349} 历 {27462A} preferred <=> 歷 {21462A} Action: • About 29 cases out of the 49 unlinked cases need attention The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up staff mode • Since Big5 mapping table is not fixed, cannot use Telnet Big5 mode any more; explore software: AnzioWin, putty • In Telnet mode, INNOPAC UTF-8 port cannot support full screen editing, only line editing is feasible CJK display corrupted in full screen editing The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Follow-up staff mode (cont) • For some local sites, e.g. CUHK, AnzioWin is used. When AnzioWin is set to CCCII mode, its mapping table CCCII.UNI can be used for Unicode mapping. • Deficiency: CCCII.UNI is one-to-one, non preferred entries cannot be included, e.g., # 274349 53D1 # not preferred 274C7B 53D1 • Better to use Innopac UTF-8 port when it is ready for editing The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
Future To migrate to pure Unicode environment…. • Abandoning EACC/CCCII will lose the linking of traditional, simplified and variant forms. • 历 U+5386 • 曆 U+66C6how to link? • 歷 U+6B77 • Linking information is available from Unihan website. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5386 Only if this linking is maintained by the vendor, migration can be considered. The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003
The HKIUG Unicode Project - The End {21387D} {215938} U+591A U+8B1D 多 謝 Thank You The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003