160 likes | 166 Views
Learn about the digitization efforts of National Library "Ivan Vazov" in Plovdiv and how OCR technology is improving access to online sources. Discover the library's traditions in library automation and the digital display of holdings, as well as the advancements made in the field. Explore the Digital Library of Plovdiv and the specificities of OCR and PDF file processing. Find out how OCR accuracy is being improved and the challenges faced in digitizing texts before the Orthographic Reform of 1945. Get insights into the CLaDa-BG project and the future of language and cultural heritage access.
E N D
Mobile DigitizingConference Improving Access to Online Sources with the Aid of OCR Technology The Experience of National Library "Ivan Vazov" - Plovdiv Veliko Tarnovo 2019
Traditions in the field of library automation and the digital display of holdings 1979 – First computer system IZOT-0310 and information search devices UPDML 9002-02 and IPU IZOT-0320. 1980 – Subscription to the AGRIS (FAO) database, stored on magnetic tapes. 1994 – Local library-information network was established in 1994. 2008 – Founding of the Digitization Centre.
Digital collections, 2010 –2016 *NALIS Repositorium 95 Slavonic manuscripts, XII–XVIII Century AD; *Europeana Photography (European ancient photographic vintage repositories of digitized pictures of historical quality) 15 000 images provided by Bulgarian archives, libraries, museums and private collections; *EMBARK (Enhance Manuscriptorium through Balkan Recovered Knowledge) 4 Slavonic manuscripts, XVI–XVII Century AD
Project “Digital Cultural and Historical Heritage Project of Plovdiv Municipality“ • 2016– 2017 • *Financed by the BG08 Program „Cultural heritage and contemporary arts“, a Financial mechanism of the European Economic Area • *Total value: 586,779 EUR • *Beneficiary: Plovdiv Municipality • *Goal: digitizationof 50 000 itemsfrom the collections ofthe libraryand: • Regional History Museum; • Regional Ethnographic museum; • Regional Archaeological Museum; • Ancient Plovdiv Municipal Institute; • City Art gallery; • Roma community.
DIGITAL LIBRARY (DIGITAL.PLOVDIV.BG) 2017 – PRESENT
Yoan LEVIEV, AnnaGREBENAROVA Design for ceramic piece, 1966 Dimitar KIROV, Smalt mosaic, design proposition
2019 – Adoption of PDF file format The main activities within the scope of the new functionality are: • Development of the software platform to upload and display PDF files with the possibility to search the indexed file contents (if OCR had been implemented). • Purchasing of ABBYY Finereader 14 software product, optical character recognition (OCR) and processing of PDF files. • One-time migration service for all existing collections in the Digital Library in order to replace the existing images in the platform with corresponding PDFs.
Developments of the Digital Library necessitated by the introduction of the ability to do content search • Changes in the user interface: • New search fields and search instructions. • Significant changes in the collection "Periodical publications“ - to display multiple search results from many issues. • New PDF gallery tool with functions: • Search by keyword, • results navigation. • Page navigation, • thumbnails, zoom. • Fullscreenview. • Page rotation, • hand tool.
Specifics of OCR and PDF file processing • OCR is performed on the basis of the master files – high quality scans of from the original paper source - the degree of recognition will be approaching its maximum. • Single PDF file for each unit of cultural heritage with lowered image quality, with a size suitable for online display (image with hidden text, no MRC). • ABBYY FineReader14 in our tests was the best at recognizing Cyrillic text. OCR accuracy is lower for texts before the Orthographic Reform of 1945 because of: • Old language, obsolete words • and letter symbols. • Physical degradation of • original source. • Restoration practices and • in-library binding in proximity • to the text.
Improving OCR accuracy • Manual cleaning of OCR errors is not feasible (unless users are involved - National Library of Australia) • Using in-program tools: • Training the program to patterns - useful in cases of non-standard fonts and is especially important for Cyrillic texts, where the training to recognize specific letter symbols is essential. • Creating a new language set which includes the old letter symbols such as Ѣ, Ѫ, Ѭ, Ѧ, Ѩ, etc.
Towards a standard for the ways in which OCR of texts before the Orthographic Reform of 1945 is performed. • It is tempting to replace the old letter symbols with their modern equivalents, which is done in order to aid the search. However, this leads to loss of information. • Spelling conversion is not fixed - the letter “Ѣ” may be represented by modern “Е” or “Я” and “Ѫ” may be represented by modern “Ъ” or “А”. • CLaDa-BG project: • A national technological infrastructure • of language, cultural and historic heritage. • Will involve the development of tools to aid access to language resources. • National Library “Ivan Vazov” is participating in the project in the areas of OCR and language tools, as well • as in the automation of data submission to online libraries such as Europeana.
Conclusion It is important to share digitization experience with other partners, working in the same field, with the aim to form a comprehensive strategy and collaborative solutions. Involvement in the CLaDa project and the development of tools to automate the submission of data to Europeana is of top priority as well. Stronger presence there is important, but the great amount of manual work disrupts our potential to contribute in a meaningful and visible way.
Ivan KratchanovHead of Digitization CentreNational Library “Ivan Vazov” – Plovdive-mail: digitization@libplovdiv.com Thank you for your attention!