400 likes | 421 Views
Learn about Unicode migration, MARC-8 environments, and challenges faced in moving from EACC to Unicode in library systems at the 7th Annual HKIUG Meeting. Explore resources and observations for migrating INNOPAC, OCLC, and more.
E N D
7th Annual Hong Kong Innovative Users Group Meeting11 and 12 December 2006 HKUST Library HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk
Contents • HKIUG Unicode Task Force • CJK/Unicode Resources and the Unicode Version of TSVCC Table • Migrating INNOPAC’s storage environment from EACC to Unicode • MARC-8 and Unicode Environments • Outstanding Issues
曆法历法 [System for determining the beginning, length and divisions of a year]
曆法was incorrectly displayed as 歷法.Is it a data entry error? a display problem? or what?
Observation #1: • Although OCLC WorldCat’s storage environment has been migrated to Unicode and its Connexion client is Unicode-based, works are not finished yet. There are still problems that require attention • How about INNOPAC and its Unicode Storage Environment? How ready is it for existing EACC-based sites to migrate to?
Round-trip Crosswalk Failure EACC Library 1. Library contributes 历in EACC {274349}, which is the simplified form of 曆 4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷 Step 2: U+7CFB 系 Export from OCLC Import to OCLC 3. Connexion finds {274349} and {27462A} in mapping table and decides to output历in EACC {27462A} 2. Connexion finds {274349} in mapping table andstores 历in Unicode U+5386 OCLCWorldCat Unicode
Observation #2: • The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only be achieved when majority of systems store and use data natively in Unicode • Immediate need for INNOPAC sites to migrate to Unicode storage environment!
HKIUG Unicode Task Force • In 2003-2004, an ad hoc group of systems librarians and catalogers from member libraries worked closely with Innovative Interfaces, Inc. (III) on issues related to CJK and the EACC to Unicode mappings. • Developed HKIUG Version of the EACC to Unicode mapping table • Resolved EACC to Unicode multi-mapping problem • Began drafting TSVCC (Traditional, Simplified, Variant Chinese Characters) table
HKIUG Unicode Task Force [2] • February 2005, the HKIUG Unicode Task Force was officially established to: • maintain the CJK/Unicode resources produced in 2003-2004; • develop new resources, such as the Unicode Version of the TSVCC table; • facilitate the searching, display and retrieval of CJK records in library catalogs; and • assist member libraries in migrating from EACC-based character encoding to Unicode
HKIUG Unicode Task Force [3] • Member of the Task Force: • CHAN Wai Ming (Secretary), University of Hong Kong • HO Yee Ip, Chinese University of Hong Kong • LAM Ki Tat (Chair), The Hong Kong University of Science and Technology • Joanna PONG, City University of Hong Kong • SUN Zehua, The Hong Kong University of Science and Technology • Mr. Philip WONG, City University of Hong Kong • Recruiting new members – we welcome colleagues to join force …
HKIUG Unicode Task Force [4] • Achievements in 2006: • July 2006 - finished and released the Unicode Version of the TSVCC Table • August 2006 - released the CJK/Unicode Resources developed over the past three years to the Internet for open access [http://hkiug.ln.edu.hk/unicode/] • November 2006 – visited Hong Kong Shue Yan College (HKSYC) Library to study its Unicode Storage Environment; and reported outstanding issues to III.
TSVCC Table - Unicode Version • When searching 历法 “Li fa”, you will prefer to retrieve records that have: • 历法 • 曆法 where 曆 and 历 have a Traditional – Simplified relationship • Similarly, when searching 屏, you will prefer to retrieve its Variant屛 • Requires linking T,S,V forms during searching
TSVCC Table - Unicode Version [2] • Results of implementing TSVCC Linking: • Improvement in searching – higher recall • Trade-off – lower precision • If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously
TSVCC Table - Unicode Version [3] • HKIUG Unicode Task Force constructed two versions of TSVCC tables • EACC Version [1.0 released August 2005] • Unicode Version [1.0 released July 2006] for INNOPAC systems that store characters in EACC and in Unicode respectively
TSVCC Table - Unicode Version [4] • TSVCC link cases collected in the Unicode Version are: • derived from the EACC Version, e.g.EACC link, U+XXXX multi-mapped; • harvested from Unicode Consortium’s Unihan Database, e.g.kSimplifiedVariant, kZVariant; • proposed by the Unicode Task Force members, e.g.hkiugSimplifiedVariant, hkiugZVariant
TSVCC Table - Unicode Version [5] • Examples of Link Cases in Unicode Version: U+66C6 曆 | U+5386 历 | U+66A6 暦 | U+6B77 歷 | U+6B74 歴 | U+F98B 曆 | U+F98C 歷 | #EACC link ([21/27/2D]4349),([21/27/4B]462A) AND U+5386 multi-mapped 27462A,274349 AND kZVariant of U+F98B is U+66C6 AND kZVariant of U+F98C is U+6B77 U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link ([27/21]415A) AND hkiugZVariant of U+5C4F is U+5C5B
TSVCC Table - Unicode Version [6] • Support linking of CJK Compatibility Ideographs • e.g. [U+F92F勞]in theprevious screen dump, a variant from KS C5601-1987 • Support linking offorms used differently in Mainland China and in Hong Kong, for example:
TSVCC Table - Unicode Version [7] • We welcome contribution from CJK experts and colleagues of member libraries to enhance the TSVCC tables • e.g. projects to establish TSVCC links from Hangul Syllables, Hiragana and Katakana to CJK ideographs
MARC-8 and Unicode Environments • In 2000, the Library of Congress issued: Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment[http://www.loc.gov/marc/specifications/speccharintro.html] • MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC)
21 62 62 21 39 25 21 30 21 黃 大 一 A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment
MARC-8 and Unicode Environments [2] • UCS/Unicode Environment[http://www.loc.gov/marc/specifications/speccharucs.html] • Use UTF-8 as character encoding • Leader position 9 contains value “a” • Field 066 (Character Sets Present) is not needed • The script identification information in subfield 6 (Linkage) can be dropped • Lengths specified by number of 8-bit bytes, rather than number of characters.
MARC-8 and Unicode Environments [3] • Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify
A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment
Migrating from EACC to Unicode • The following INNOPAC systems are in Unicode Storage Environment: • HKSYC (Hong Kong Shue Yan College) • HKALL (the INN-Reach system for the eight universities in Hong Kong) • HKUST Tool Testing Database
Migrating from EACC to Unicode [2] • HKSYC Visit • A group of systems librarians and catalogers from member libraries visited HKSYC Library in November 2006 to learn how its INNOPAC system works in Unicode Storage Environment • A number of outstanding issues were identified and/or confirmed • If you have migrated to Unicode storage or plan to migrate now, you might also face the same problems
Migrating from EACC to Unicode [3] • Outstanding Issues • TSVCC Linking not turned on; and even if turned on, it would not be using the latest HKIUG version • When entering CJK characters via Millennium Editor, such as U+8AAC 説 and U+7CB5 粵, and saving the record, these characters would be stripped away and not saved - destructive bug awaiting fixing
Migrating from EACC to Unicode [4] • Export from INNOPAC - only export in MARC-8 Environment was provided. There should be option for users to export in Unicode Environment • III replied that this option is available • Import (Load) into INNOPAC - only import in MARC-8 Environment was provided. There should be option for users to load MARC records in Unicode Environment (i.e. in UTF-8). • III replied that this option is available
Migrating from EACC to Unicode [5] • It seemed that sorting at HKSYC is still EACC-based • Sorting key seemed to be constructed from:[No. of strokes][EACC code value] • For example, as observed from WebPAC’s URL, sorting key for 中國 is: “04{213034}11{21376f}”.It should instead be sorted in Unicode code value, i.e. “04{u4e2d}11{u570b}”
Migrating from EACC to Unicode [6] • Also need to fix the illogical sorting orders as found in HKUST’s Tool Testing Database: 1: ASCII space/punctuations (e.g. :) 2: ASCII numerals (e.g. 1) 3: CJK characters with pinyin (e.g. 中) 4: ASCII Alphabets (e.g. a) 5: CJK characters without pinyin (e.g. を)
Migrating from EACC to Unicode [7] • Pure Unicode Storage Environment • Once migrated to Unicode Storage Environment, there should not be needs for mapping back and forth between EACC and Unicode, except for some necessary conversion routines • In order to maintain a natively Unicode environment, EACC dependence should be identified and eliminated
Conclusion • How far are we towards native Unicode? • Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records • ILS vendors including III are working very hard to implement and enhance the Unicode support • Libraries and CJK experts are providing advice and suggesting solutions
Conclusion [2] • Migrating INNOPAC to Unicode • We have reviewed various outstanding issues as found in INNOPAC’s Unicode Storage Environment • We hope these issues will be resolved quickly so that HKIUG member libraries can start to migrate their systems to Unicode • HKIUG Unicode Task Force will continue to work closely with III to enable a smooth migration
Additional Readings • K.T. Lam. EACC to Unicode migration. OCLC-CJK Users Group 2006 Annual Meeting.[http://hdl.handle.net/1783.1/2500] • Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429]