380 likes | 532 Views
Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -. Tsutomu SUZUKI ( tsutomu@waseda.jp ) Waseda University Library 4 th Hong Kong INNOPAC Users Group Meeting December 2003. WASEDA University Overview. Founded in 1882
E N D
Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI (tsutomu@waseda.jp) Waseda University Library 4th Hong Kong INNOPAC Users Group Meeting December 2003
WASEDA University Overview • Founded in 1882 • Now has:-- 10 undergraduate schools-- 14 graduate schools-- 5 large campus libraries & 27 small libraries-- 2 university museums -- 44,576 undergraduate and 6,147 graduate students (as of end April, 2002)
Library Overview (as of March 31, 2002) • 4,705,597 books(2,980,352 cjk books + 1,725,245 western books) • 49,615 journal titles(Currently subscribing 19,509) • 879,336 items checked out / year • ILL transactions: 13,951 requesets to other libraries: 18,491 requesets received from other libraries • Total number of Central Library visits: 1,197,731 (2002.4 – 2003.3)
Current Status of Our INNOPAC Recent record numbers (Oct. 29, 2003) from M-I-F-S • 1,752,690 bibliographic records • 3,434,122 item records • 52,133 check-in records Public Catalog Searches from “ANALYZE patron searches” • 5,149,322 searches (2002.4- 2003.3)
Unicode Port on WEBPAC • On November 17th, Unicode OPAC was released to the public. ( some character code troubles still remain....) • Downloading Chinese & Korean bib data from OCLC. • Record Maitainance: AnzioWin • Number of the C & K bib records (as of 11th Nov.):15,971 bibs of Chinese materials:157 bibs of Korean materials
Case1: Mapping Error The screen below shows my patron record on Millennium Circulation. One of Katakana character “Zu” is not displayed properly.
Case1: Mapping Error If I search “suzuki” on Unicode-OPAC, “zu” is ignored and “suki” hit.
Case1: Mapping Error SJIS EACC UNICODE EACC: 69253A SJIS: 253A UNICODE:30BA This EACC character is NOT mapped to any UNICODE character. It should be mapped to 30BA in UNICODE.
Case2: Shift-JIS to EACC Issue When I search for this hanji on Shift_JIS OPAC, then Innopac returns only 9 records.
Case2: Shift-JIS to EACC Issue SJIS EACC UNICODE EACC: 214930 SJIS: 97E9 The EACC character ”215D58” is not assigned any glyph, according to the OCLC CJK 3.11. But the mapping from S-JIS to EACC works fine.
Case2: Shift-JIS to EACC Issue On the other hand, I searched this hanji on Unicode OPAC, then Innopac returned more than 2,000 records!
Case2: Shift-JIS to EACC Issue SJIS EACC UNICODE SJIS: 97E9 EACC: 214930 These Shift_JIS and Unicode characters have the same glyph, but Innopac stored them into two different EACC code positions. Therefore we can NOT search both characters at once. No relationship EACC: 455564 UNICODE: 6FDB
Case2: Shift-JIS to EACC Issue SJIS EACC UNICODE SJIS: 97E9 EACC: 214930 One of the solutions Change the mapping of this Shift_JIS character from 214930 to 455564. EACC: 455564 UNICODE: 6FDB
Case3: EACC Layers Related Issue Shift_JIS Telnet Screen Sample (my record). The data is displayed correctly.
Case3: EACC Layers Related Issue SJIS EACC UNICODE EACC: 215D58 SJIS: 97E9 In Shift_JIS environment, there is no troubles in searching and displaying this character.
Case3: EACC Layers Related Issue We can see the same data properly on Millennium. {69253a} is other problem already mentioned in case 1.
Case3: EACC Layers Related Issue Reviewing the same data AFTER editing an element (NOTE) on Millennium. EACC character codes are displayed directly at one of name field and address.
Case3: EACC Layers Related Issue We can see the data correctly on Millennium even after editting.
Case3: EACC Layers Related Issue SJIS EACC UNICODE EACC: 215D58 SJIS: 97E9 Relationship Same code position on other layers UNICODE: 9234 EACC: 4B5D58
Case3: EACC Layers Related Issue SJIS EACC UNICODE EACC: 215D58 SJIS: 97E9 If records including this character are saved on Millennium, this hanji is NOT stored as original EACC code (215D58). Relationship Same code position on other layers UNICODE: 9234 {4B5D58} EACC: 4B5D58 No character assigned
Case4: Duplication codes in EACC There are more than 1,000 records by “matsu” on Shift_JIS OPAC.
Case4: Duplication codes in EACC There is ONLY one record by “matsu” on Unicode OPAC. (The below shows direct hit result.)
Case4: Duplication codes in EACC SJIS EACC UNICODE SJIS: 8FBC EACC: 21442D We can DISPLAY both 21442D and 276163 in Unicode OPAC, but only 276163 is searchable. Because of this EACC code duplication, the search results is NOT same between Shift_JIS OPAC and Unicode OPAC. EACC: 276163 UNICODE: 677E
Case5: Not Unified characters in UNICODE Do you think these two characters are same or not? UNICODE: 5618 UNICODE: 5653
Case5: Not Unified characters in UNICODE The result of searching “uso” on Shift_JIS OPAC.
Case5: Not Unified characters in UNICODE The same search on Unicode OPAC. The result does not seem correct .
Case5: Not Unified Characters in UNICODE Input the other “uso” by picking up from code table, the result is the same as Shift_JIS OPAC.
Case5: Not Unified Characters in UNICODE SJIS EACC UNICODE UNICODE: 5618 SJIS: 8952 NOT HIT! EACC: 21373B UNICODE: 5653
Case5: Not Unified Characters in UNICODE SJIS EACC UNICODE UNICODE: 5618 SJIS: 8952 This 5618 should be normalized as 5653 in searching. EACC: 21373B UNICODE: 5653
Normalization issue This search means “Harry Potter” in Katakana form. Some special characters are ignored at searching on Unicode OPAC. In this sample, “Cho-on” , Japanese prolonged sound symbol does not work.
Example of NOT unified characters (Case5) Unicode:6236,6237,6238
Related Documents & Information • The Library of Congress HomepageMARC 21 Specifications for Record Structure, Character Sets, and Exchange Media -- CHARACTER SETS: Part 3 -- Code Table 9: EAST ASIAN (June 16, 2003)http://www.loc.gov/marc/specifications/specchareacc.html • The Unicode Standard Version 3.0. The Unicode Consortium. ISBN 0201616335 (Version 4.0 released now) • OCLC CJK and it’s contents in HELPhttp://www.oclc.org/cjk/
Unicode Opac in Japan • University of TokyoMultilingual OPAC the University of Tokyo http://mulopac.dl.itc.u-tokyo.ac.jp/ • National Diet LibraryNDL Asian Language Materials OPAC http://asiaopac.ndl.go.jp/index_e.html
The Best Solution Unicode + normalization scheme Thank you!!