370 likes | 659 Views
6 th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG’s Unicode Projects Untangling the Chaotic Codes. Philip Wong City University of Hong Kong Library K.T. Lam Hong Kong University of Science and Technology Library. Content. Chaos in 2003
E N D
6th Annual Hong Kong Innovative Users Group Meeting8-9 December 2005, Hong KongHKIUG’s Unicode ProjectsUntangling the Chaotic Codes Philip Wong City University of Hong Kong Library K.T. LamHong Kong University of Science and Technology Library
Content • Chaos in 2003 • Collaborative effort at HKIUG • HKIUG CJK Code Table • TSVCC linking • Towards native Unicode catalog 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Chaos in 2003 • Local libraries were using BIG5 Chinese character encoding system • INNOPAC was in the transition towards Unicode support, with the development of the Millennium software • Dual Web OPAC interfaces existed: Big5 and UTF-8 (Unicode) • Some libraries (HKUST and CUHK) began releasing UTF-8 Web OPAC to their users 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Chaos in 2003 [cont.] • INNOPAC’s EACC to Unicode mapping is problematic: • multiple mappings • incorrect mappings • missing codes • duplicated EACC and CCCII • mapping to different EACCs in BIG5 and UTF-8 interfaces 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Chaos in 2003 [cont.] • CJK support in Millennium software was buggy • Millennium Editor – involuntarily replacing characters with preferred EACC • Individual libraries communicated with the vendor • not fruitful – fixes were in piece-meal fashion • Some libraries conducted their own CJK / Unicode study with attempts to propose to the vendor how to tackle these problems – again without much progress • HKUST (April 2003) • City University of Hong Kong (July 2003) 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Collaboration Effort at HKIUG • June 2003 – HKIUG Standing Committee agreed that a joint proposal was essential for gaining acceptance from the vendor • July 2003 – seminar organized by CUHK to solicit ideas and comments • July 2003 – III-UTF-8 Working Group established, members consisted of catalogers and systems librarians from CITYU, CUHK, HKUST and HKU 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Collaboration Effort at HKIUG [cont.] • Sep 2003 – Working Group completed the study and submitted the proposal to the vendor together with a HKIUG version of the EACC to Unicode Mapping Table • Oct 2003 – vendor accepted the proposal • Dec 2003 – presentation of the work in 4th Annual HKIUG Meeting • Jan 2004 – HKUST representative was invited to vendor’s Headquarters to help resolve outstanding CJK issues 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Collaboration Effort at HKIUG [cont.] • Results of the HKIUG effort, by February 2004: • Millennium Editor problem fixed • HKIUG Code Table for CJK Characters adopted • Began development of TSVCC Linking • 25 February 2005 – established HKIUG Unicode Task Force to maintain the Unicode and TSVCC code tables and to assist the vendor on Unicode migration; members from CUHK, CITYU, HKUST and HKU. 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Case “li” 274349 历 (Simplified form of 曆) EACC-basedINNOPACCatalog Unicode-basedMillenniumEditor U+5386 历 27462A 历 (Simplified form of 歷) Incorrect! Millennium Editor Problem • EACC<->Unicode Mapping Table failed in round-trip crosswalk. 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Millennium Editor Problem [cont.] • Problem: EACC character 274349 in INNOPAC Catalog would be incorrectly replaced by 27462A when it was saved in Millennium Editor • Fixed by suppressing Millennium Editor from converting 274349 (i.e. non-preferred code multi-mapping) to U+5386 when it was retrieved from the catalog for editing • By using a one-to-one mapping table 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Millennium Editor Problem [cont.] • Side effect • The affected character is displayed as braced-code, not as character, in the Editor 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
HKIUG CJK Code Table • First released in September 2003; last revised in August 2005 • Contains: • 15672 EACC characters • 7043 pure CCCII characters • 160 multi-mapping linked cases • 49 multi-mapping unlinked cases 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
HKIUG CJK Code Table [cont.] • Mapping for EACC characters - follows LC as much as possible • Does not contain CCCII characters that have EACC equivalent - sites adopting HKIUG CJK code table must convert these CCCII in their Catalog to the EACC equivalents • Contains 7043 “Pure CCCII” that have no EACC equivalent - includes them to avoid too many missing characters 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
HKIUG CJK Code Table [cont.] • Multiple mappings Linked case “ling” Unlinked case “li” • HKIUG decides on the preferences 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
HKIUG CJK Code Table [cont.] • Also available in XML format, conforming to LC’s code tables schema • Implementation • November 2003 – Pilot testing at HKUST • February 2004 – CUHK • July 2004 – PolyU • October 2004 – CityU, HKU • November 2004 – LU, HKBU • March 2005 – HKIED • December 2005 – HKAPA (scheduled) 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking • TSVCC stands for “Traditional, Simplified and Variant Chinese Characters”. • Example – “guo” 國(U+570B) – Traditional form of “country” 国(U+56FD) – Simplified form of “country” 囯(U+56EF) – Variant form of “country” (used in Japanese) • Example – “xi” 係(U+4FC2) – Traditional form of “relationship” 繫(U+7E6B) – Traditional form of “linking” 系(U+7CFB) – Traditional form of “system”, simplified form of “relationship”, and simplified form of “linking” • Why TSVCC? 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • In EACC, traditional, simplified and variant characters can be linked by internal codes • “gan”乾(21304C) linked to 干(27304C ) • “feng”峰(213B78) linked to 峯(2D3B78 ) and 峄(393B78) • However, some multi-mapping cases remain unlinked • “gan” 干(27304C ) not linked to 干(273C67) • “li” 历(274349) not linked to 历(27462A) 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] Consider the following multi-mapping case: Searching历法(27462A)(21472A) will not retrieve 曆法(2D4349)(21472A) 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • Native Unicode catalog – all internal linkings will be gone • 乾(U+4E7E), 干(U+5E72) • 峰(U+5CF0), 峯(U+5CEF), 峄(U+5CC4) • 历(U+5386), 曆(U+66C6),歷(U+6B77) • How to maintain the linkings? 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • In October 2004, HKIUG constructed the TSVCC Linking Tables and proposed to the vendor • Table M – linking relationship is not purely from EACC 214349 曆| 274349 历| 2D4349 暦| 21462A 歷| 27462A 历| 4B462A 歴| #U+5386 multi-mapped 27462A,274349 • Table V – linking relationship is purely from EACC 21306C 仇| 2D306C 讎| 33306C 讐| 4B306C 雠 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • Implementation • October 2004 – created the TSVCC Tables; installed on HKUST’s testing database • November 2004 – endorsed by HKIUG, first release • November 2004 – TSVCC linking capability was enabled at CityU and HKU (using vendor’s original tables; i.e. not HKIUG’s version) • Lingnan uninstalled after a short period of trial due to high recall rate • August 2005 – HKIUG second release • November 2005 – CityU installed second release 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • HKALL has also enabled the TSVCC Linking feature – but using hybrid EACC/Unicode tables (using normalized EACC values to maintain default ordering for CJK) • Drawback: Unicode is a much bigger set than EACC; and again, need to maintain the legacy EACC mappings • Vendor should put in programming effort to support Unicode Version of TSVCC tables. 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • Results of implementation • Improvement in searching • Trade-off: higher recall, lower precision 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • Results: improvement in searching Search 历法“Li fa” 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Search 甦齋“Suzhai” TSVCC on TSVCC off relevant irrelevant TSVCC Linking [cont.] • Results: higher recall, lower precision 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • Problems found during testing and implementation • They are not the problems of TSVCC, but are software problems which require software enhancement from vendor 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
TSVCC Linking [cont.] • Problem 1 • Incorrect “duplicate headings error” in authority heading verification Duplicate authority RECORDS 02-11-04 33 > FIELD: 100 1 |a何迺欣 INDEXED AS AUTHOR: 何乃欣 MESSAGE: --------------- DUPLICATE AUTHORITY ---------- FROM: a1525012x • 何乃欣and 何迺欣are actually two different authors • 乃 {21303A} and 迺 {33303A} are linked EACC but this problem does not happen in non-TSVCC indexing 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
曆U+66C6 歷 U+6B77 历U+5386 TSVCC Linking [cont.] • Problem 2 • Interfiling of indexed characters becomes worse in TSVCC when recall is higher. Ideal is to separate indexing and sorting. 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Towards Native Unicode Catalog • How far are we? • LC has issued MARC-8 to Unicode mapping tables • OCLC Connexion client 1.5 begins to support MARC record import and export in UTF-8 encoding • Intensive discussion of Unicode implementation in MARC at UNICODE-MARC Discussion List (UNICODE-MARC@loc.gov) • Most ILS vendors claim to support Unicode 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Towards Native Unicode Catalog [cont.] • INNOPAC is almost there, but not fully ready yet. • There is option for sites to convert their catalogs to Unicode (e.g. HKALL has done so in Oct 2004) • It was noted from the HKALL catalog that the implementation of Unicode is only partially completed - there are still EACC dependency in the data store and indexes • INNOPAC/Millennium has not yet supported exporting and importing of records in UTF-8 • CJK searching and sorting require more work 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Round-trip Crosswalk Failure EACC/ Unicode OCLC Step 1: 275175 系 (simplified of 繫 ) Step 2: U+7CFB 系 Library Catalog 2 Library Catalog 1 Step 3: 21506E 系 or 273169 系 or 275175 系 (Traditional 系or simplified of 係 or 繫 )? EACC Unicode Towards Native Unicode Catalog [cont.] • Bibliographic data interchange involves multiple partners. 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Towards Native Unicode Catalog [cont.] • The failure of round-trip crosswalk between systems will continue to be a problem until all systems are capable of importing and exporting data in Unicode and no one are interchanging MARC records in non-Unicode encoding 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam
Thank You! Contact Information Philip Wong lbphilip@cityu.edu.hk K.T. Lamlblkt@ust.hk 6th HKIUG Meeting, Dec 8-9 2005, Lingnan University. HKIUG's Unicode Project, Philip Wong and KT Lam