150 likes | 338 Views
CJK Character Validation – Impact from EACC to Unicode Migration. 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5, 2006. EACC/MARC21 and Unicode. East Asian Character Code (EACC) is MARC-8 CJK in MARC21 Migration to Unicode
E N D
CJK Character Validation –Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5, 2006
EACC/MARC21 and Unicode • East Asian Character Code (EACC) is MARC-8 CJK in MARC21 • Migration to Unicode • Library of Congress database • RLG’s Union catalog database • OCLC’s WorldCat database • CJK Bibliographic records are restricted to “EACC characters”
Microsoft IME Variants • Non-MARC21 characters • Duplicate CJK characters (e.g.路, F937, and 路, 8DEF) • Close variants(e.g.步, 6B65, and 歩, 6B69) • Typically one of these variants is a MARC21 character • CJK character validation errors in OCLC • OCLC XWC (Extended WorldCat) in Oracle database is built on Unicode • OCLC online cataloging follows MARC21 standards • CJK scripts are input by using Microsoft Global Input Method Editors (IMEs) • Non-MARC21 characters cause CJK character validation errors
OCLC Connxion / IME Online Cataloging Examples • Title: 汉宫秋月 (simplified宫) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 2 - invalid character - data must be valid non-Latin characters • Valid when changed to: 汉宮秋月 (traditional宮) • Title:瑶族长鼓舞曲 (simplified瑶) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • Valid when changed to: 瑤族长鼓舞曲 (traditional瑤) • Title: 説故事的人 (traditional説) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • Valid when changed to: 說故事的人 (traditional說)
OCLC Connxion / IME Online Cataloging Examples • Title: 户外环境敎育 (simplified 户) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • 戸外环境敎育(traditional 戸) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • Valid when changed to: 戶外环境敎育 (traditional 戶) • Title: 吴大澂手批本地子箴言 • 澂only can be found in the traditional list; this character does not exist in the simplified list
Solutions • Unihan Database • CJK Compatibility Database • OCLC CJK E-dictionary
Unihan Databasehttp://www.unicode.org/charts/unihan.html • Unihan database index • Unihan grid index • Unihan radical-stroke index • Unihan database information • (I) Several different glyphs for the character • (N) Different representations of the character's scalar value • (N) Mappings to the IRG sources for the character • (I) Mappings to major industrial and national standards and other character collections • (N) Positions in the four dictionaries used by the IRG • (I) Positions in other commonly-used dictionaries • (I) Radical-stroke counts as derived from different sources • (I) Phonetic data derived from various sources • (I) Other dictionary data • (I) Variants (with links to the variant forms) • Compounds containing the character • (I) Other information contained in the Unihan database
CJK Compatibility Databasehttp://www.loc.gov/ils/cjk_search/cjk_cpso.html • Replace a non-MARC21 character with its MARC21 equivalent • Steps for using theCJK compatibility database • Copy the invalid character from your bibliographic record • Open the CJK Compatibility Page • Paste the invalid character in the white box and use the index "Invalid character" • Click "Submit" • Copy & Paste the valid alternative into your bibliographic record
CJK Character Validation Thank you!