DML-CZ: Scanning and adjusting the images

DML-CZ:Scanning and adjustingthe images Martin Lhoták Academy of Sciences Library Launching the DML-CZ 11.5.2008 Prague

DML-CZ Workflow • Preparation • Scanning and adjusting the images • OCR • Metadata harvesting (MR, ZBL) • Integration • Digital Library

Content • Digitization Centre of the AS Library • Scanning • Adjusting the images • Basic metadada • OCR • Backup and movement of the data • Production till now

Digitization Centre of the AS Library • In operation since1.1.2004 • Builded with support from EU Solidarity fund after floods in Czechia in 2002 • Main aim - to build a digital library of scientific publications, published in the Academy of Science of the Czech Rep. Digital Library of ASCR • Partner of DML-CZ project since 2005

The Academy of Science of the Czech Republic • > 50 scientific institutes • 7500 employees, (4000 R&D) • > 11 000 articles, reports, etc. a year • publish > 90 journals (circa 3000 articl.) • > 100 years history

Digitization Centre of the AS Library • 2 x A2 bw scanners Zeutschel OS 7000 • 1 x A1 color scanner Digibook 10000 • 1 x A4 fast production scan. Panasonic • Staff – 8 to 10 people • Monthly production 40 - 50.000 pages • Overall production > 2.000.000 pages

DML-CZ: Scanning • 2 x A2 bw scanners Zeutschel OS 7000 • 600 DPI • 4 bit greyscale • 1 page = 1 file • usually A5 • TIFF with lossless LZW compression circa 10 MB

Image Adjusting • Software Book Restorer from i2S • Designed to process scanned books • Geometrical correction • Crop • Blur • Binarization • Despecle

Basic Metadata • XML (DTD of The Czech National Library) • Title basic biblographic data • Physical size of the journal • Numbers of pages • Software Sirius (CZ)

OCR • Fine Reader 8.1 • 2 runs: - 1. to recognize language of paragraph - 2. to do OCR with right language OCR workflow developed by team of Dr. P. Sojka • Output – double layer PDF: - 1. layer scanned picture - 2. layer „OCRed“ text

Back up and movement of the data • Main steps and outputs: 1. scanning – TIFF 2. image adjust. and basic metadata – TIFF, XML 3. OCR – PDF • After each step above: One copy to server in Brno Two copies on LTO tapes

Production for DML-CZ till now • Scanning: 97 268 pages • Image adjust.: 123 961 pages • Basic metadata: 96 009 pages • OCR: 126 278 pages Disproportion: some data was obtained from GDZ Goettingen

Alternative output of the Acad. of Sci. mathematic http://kramerius.lib.cas.cz

Thank you! Questions? Martin Lhoták lhotak@knav.cz www.knav.cz

DML-CZ: Scanning and adjusting the images