280 likes | 505 Views
OCR at INIS. INIS Training Seminar 7-11 October 2013, Vienna, Austria. Branko Krznari ć. INIS Unit. ( ba sed on the presentation b y Yves Reynaud). Outline. What is OCR ? OCR Objectives Principles Techniques Software. What is OCR?. (source: pcmag.com).
E N D
OCR at INIS INIS Training Seminar 7-11 October 2013, Vienna, Austria Branko Krznarić INIS Unit (based on the presentation by Yves Reynaud)
Outline • What is OCR? • OCR Objectives • Principles • Techniques • Software INIS Training Seminar 7-11 October 2013, Vienna, Austria
What is OCR? (source: pcmag.com) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Optical Character Recognition (OCR) • OCR is the “conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text.” [1] • Make digitized images of printed documents searchable. • Font encoding issues. INIS Training Seminar 7-11 October 2013, Vienna, Austria
OCR Objectives We can “find the needle in the haystack” • OCR offers a basic search from an unstructured document. • OCR adds an extra valuetoyourimage. • OCR bringstolifeyourdigitizedcollection. INIS Training Seminar 7-11 October 2013, Vienna, Austria
OCR Techniques • Pre-processing • De-skew • Despeckle • Binarization (optional) • Line removal • Layout analysis (zoning) • Post-processing (dictionary) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Scanned vs. Vector Image INIS Training Seminar 7-11 October 2013, Vienna, Austria
“Do not lookatthetrees (letters)trytoseetheforest (sentences)“ F0R 488UR1N6 7H3 L0N63V17Y 0F 1NF0RM4710N, P3RH4P8 7H3 M087 1MP0R74N7 R0L3 1N 7H3 0P3R4710N 0F 4 D16174L 4RCH1V3 18 M4N461N6 7H3 1D3N717Y, 1N736R17Y 4ND QU4L17Y 0F 7H3 4RCH1V38 1783LF 48 4 7RU873D 80URC3 0F 7H3 CUL7UR4L R3C0RD. INIS Training Seminar 7-11 October 2013, Vienna, Austria
Verdana Font FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar 7-11 October 2013, Vienna, Austria
Brush Script MT (Windows Font) FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD. INIS Training Seminar 7-11 October 2013, Vienna, Austria
PCs≠ Humans • OCR compares patterns and selects the closest match. It can be forced to a specific context, but requires customization. • People adapt to circumstances and can circumvent misspellings if context is clear. INIS Training Seminar 7-11 October 2013, Vienna, Austria
True or false Usually, printed text is adequately sampled if each line is at least two pixels in thickness: INIS Training Seminar 7-11 October 2013, Vienna, Austria
Zoom in INIS Training Seminar 7-11 October 2013, Vienna, Austria
Zoom in INIS Training Seminar 7-11 October 2013, Vienna, Austria
Results from OCR It is in this context that I… … and an additional protocol on the basis… INIS Training Seminar 7-11 October 2013, Vienna, Austria
Chinese Raster Image (scanned) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Chinese Vector Image (OCR) 滤器 INIS Training Seminar 7-11 October 2013, Vienna, Austria
Arabic Raster Image (scanned) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Arabic Vector Image (OCR) هذ ا وشملت INIS Training Seminar 7-11 October 2013, Vienna, Austria
Japanese Raster Image (scanned) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Japanese Vector Image (OCR) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Font Encoding INIS Training Seminar 7-11 October 2013, Vienna, Austria
Font Encoding (cont.) INIS Training Seminar 7-11 October 2013, Vienna, Austria
OCR Software • AbbyyFineReader(multilingual OCR) • Adobe Acrobat • InftyReader INIS Training Seminar 7-11 October 2013, Vienna, Austria
AbbyyFineReader(interface) INIS Training Seminar 7-11 October 2013, Vienna, Austria
InftyReader - an OCR System for Math Documents (12) where a . The indices now range from 1 to 5. The bosonic fields obey the commutation rules (13) INIS Training Seminar 7-11 October 2013, Vienna, Austria
Reference [1] “Optical character recognition” http://en.wikipedia.org/wiki/Optical_character_recognition. Retrieved 2013-09-23. INIS Training Seminar 7-11 October 2013, Vienna, Austria
Thank you! INIS Training Seminar 7-11 October 2013, Vienna, Austria