1 / 15

CJK Character Validation – Impact from EACC to Unicode Migration

CJK Character Validation – Impact from EACC to Unicode Migration. 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5, 2006. EACC/MARC21 and Unicode. East Asian Character Code (EACC) is MARC-8 CJK in MARC21 Migration to Unicode

eadoin
Download Presentation

CJK Character Validation – Impact from EACC to Unicode Migration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CJK Character Validation –Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5, 2006

  2. EACC/MARC21 and Unicode • East Asian Character Code (EACC) is MARC-8 CJK in MARC21 • Migration to Unicode • Library of Congress database • RLG’s Union catalog database • OCLC’s WorldCat database • CJK Bibliographic records are restricted to “EACC characters”

  3. Microsoft IME Variants • Non-MARC21 characters • Duplicate CJK characters (e.g.路, F937, and 路, 8DEF) • Close variants(e.g.步, 6B65, and 歩, 6B69) • Typically one of these variants is a MARC21 character • CJK character validation errors in OCLC • OCLC XWC (Extended WorldCat) in Oracle database is built on Unicode • OCLC online cataloging follows MARC21 standards • CJK scripts are input by using Microsoft Global Input Method Editors (IMEs) • Non-MARC21 characters cause CJK character validation errors

  4. OCLC Connxion / IME Online Cataloging Examples • Title: 汉宫秋月 (simplified宫) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 2 - invalid character - data must be valid non-Latin characters • Valid when changed to: 汉宮秋月 (traditional宮) • Title:瑶族长鼓舞曲 (simplified瑶) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • Valid when changed to: 瑤族长鼓舞曲 (traditional瑤) • Title: 説故事的人 (traditional説) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • Valid when changed to: 說故事的人 (traditional說)

  5. OCLC Connxion / IME Online Cataloging Examples • Title: 户外环境敎育 (simplified 户) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • 戸外环境敎育(traditional 戸) • 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters • Valid when changed to: 戶外环境敎育 (traditional 戶) • Title: 吴大澂手批本地子箴言 • 澂only can be found in the traditional list; this character does not exist in the simplified list

  6. Solutions • Unihan Database • CJK Compatibility Database • OCLC CJK E-dictionary

  7. Unihan Databasehttp://www.unicode.org/charts/unihan.html • Unihan database index • Unihan grid index • Unihan radical-stroke index • Unihan database information • (I) Several different glyphs for the character • (N) Different representations of the character's scalar value • (N) Mappings to the IRG sources for the character • (I) Mappings to major industrial and national standards and other character collections • (N) Positions in the four dictionaries used by the IRG • (I) Positions in other commonly-used dictionaries • (I) Radical-stroke counts as derived from different sources • (I) Phonetic data derived from various sources • (I) Other dictionary data • (I) Variants (with links to the variant forms) • Compounds containing the character • (I) Other information contained in the Unihan database

  8. Unihan Database Search(U+6237)

  9. Unihan Database Search(U+6236)

  10. CJK Compatibility Databasehttp://www.loc.gov/ils/cjk_search/cjk_cpso.html • Replace a non-MARC21 character with its MARC21 equivalent • Steps for using theCJK compatibility database • Copy the invalid character from your bibliographic record • Open the CJK Compatibility Page • Paste the invalid character in the white box and use the index "Invalid character" • Click "Submit" • Copy & Paste the valid alternative into your bibliographic record

  11. CJK Compatibility Database Search

  12. OCLC CJK E-Dictionary

  13. OCLC CJK E-Dictionary Search

  14. OCLC CJK E-Dictionary Search

  15. CJK Character Validation Thank you!

More Related