70 likes | 209 Views
Outreaching with Print resources in the digital Age. Jidong Yang University of Michigan. Problems with CJK encodings.
E N D
Outreaching with Print resources in the digital Age Jidong Yang University of Michigan
Problems with CJK encodings • The slow expansion of Chinese encodings: from GB 2312 (about 6,700 characters), Big 5 (13,000 characters) to GBK (22,000 characters), GB 18030 (27,000 characters), GB 18030-2005 (more than 70,000 characters) and Unicode Version 5 (similar to GB 18030-2005). • Not all computers have all the characters. Many existing databases are built on earlier encodings. • Mainstream Japanese encodings: JIS and EUC, each has less than 7,000 kanji characters.
The issue of OCR accuracy • When handling contemporary Chinese and Japanese publications in good conditions, the best OCR software can hardly achieve an accuracy rate better than 95%. • When processing pre-modern CJK texts, the OCR accuracy drops down to 30-40% or even lower. • Many database companies keep their OCR accuracy rate secret.
The early stage of digital scholarship • New research methods and tools suitable for digital resources are still rare and need to be invented. • A great number of research tools in print formats still retain their values, at least for now.
Databases vs. print indexes • How to find information about Kumārajīva in the Gaoseng zhuan高僧傳? • Search by Jiumoluoshi 鳩摩羅什? • Not enough! Try: Jiumoluoqipo 鳩摩羅耆婆, Shi 什, Shigong 什公, Shishi 什師, Tongshou 童壽, and Luoshi 羅什. ––– All can be found in Ryō kōsō den sakuin梁高僧傳索引, compiled by Makita Tairyō 牧田諦亮. • Databases are not necessarily better than print indexes.
Conclusion • The computer still cannot match the book in the capability of presenting the full range of East Asian languages and cultures. • Print resources are still necessary for most serious researches on East Asia. • It’s our job to make the value of our print collections known to the patrons.