430 likes | 473 Views
SSIP 2002, Budapest. Omnipage Pro. Internal Structure of the Character Recognition Engine used inside. Dr. István Marosi Recosoft Ltd., Hungary. Some „Marketing talk”. Main tasks of an OCR system: Image acquisition Layout recognition Text recognition User assisted correction
E N D
SSIP 2002, Budapest Omnipage Pro Internal Structure of the Character Recognition Engine used inside Dr. István Marosi Recosoft Ltd., Hungary
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • Result exportation Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Get image • B/W Scanning • Gray Scanning • Color Scanning • Load from image file • Preprocess image • Layout recognition • Text recognition • User assisted correction • Result exportation Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Get image • Preprocess image • Color separation • Thresholding • Despeckling • Rotation • Deskewing • Layout recognition • Text recognition • User assisted correction • Result exportation Recosoft Ltd
The Preprocessed Image Joined chars Recosoft Ltd
The Preprocessed Image Joined chars Recosoft Ltd
The Preprocessed Image Broken chars Recosoft Ltd
The Preprocessed Image Broken chars Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text zones • Columns of flowed text • Tables • Inverse text • Graphic zones • Text recognition • User assisted correction • Result exportation Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text zones • Graphic zones • Line Art • Photo • Text recognition • User assisted correction • Result exportation Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • ... Let’s do it when the marketing staff is over... • User assisted correction • Result exportation Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • By the user’s random editing... • Pop-up verifier • Manual Training • By proofreading of doubtful words • Result exportation Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • By the user’s random editing... • By proofreading of doubtful words • Correct: User dictionary • Changed: IntelliTrain • Remember trained characters • Apply them on following pages • Result exportation Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something Substitutions found: m rn thiUü Recosoft Ltd
IntelliTrain Recognized word: sorneUüng Fixed word: something Substitutions found: m rn thiUü Perform automatically: • Learn image pattern and substitution info • Find similar substituted (‘blue’) text on actual page • Match against pattern of substitution and correct • Find such errors on following pages, too Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • Result exportation • Combine pages into a Document • Header / Footer recognition • Page numbers • Hyperlinks (e.g. „See Table 20”) • Save results Recosoft Ltd
Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • Result exportation • Combine pages into a Document • Save results • doc file • e-mail • Speech synthesizer Recosoft Ltd
OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (codename: Salt & Pepper) • Recognita’s engine (codename: Paprika) Recosoft Ltd
OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Uses a Matrix Matching based algorithm • feature set: 40 cells of an 8x5 grid • good overall description of a shape • weaker at detailed structure • Recognita’s engine (Paprika) • Uses a Contour Tracing based algorithm • feture set: convex and concave arcs on the contour • good detailed description of a shape • weaker at overall structure Recosoft Ltd
OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Recognita’s engine (Paprika) • Segmentation algorithms: Recosoft Ltd
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
Segmentation What are those pixel groups belonging to a single letter?
OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Recognita’s engine (Paprika) • Segmentation algorithms: • Developed by independent groups • Have different strengths and weaknesses Recosoft Ltd
OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Recognita’s engine (Paprika) • Segmentation algorithms • Conclusion: • They are complementary • Let’s create a voting system Recosoft Ltd
OP11 Internals Image • Voting strategies • External „Black box”voting Paprika Salt &Pepper Txt 2 Txt 1 Vote? Final Txt Recosoft Ltd
OP11 Internals Image • Voting strategies • External „Black box”voting Paprika Salt &Pepper Txt 2 Txt 1 Dict Vote Final Txt Recosoft Ltd
OP11 Internals Image • Voting strategies • External „Black box”voting~15% gain Paprika Salt &Pepper Txt 2 Txt 1 Dict Vote Final Txt Recosoft Ltd
OP11 Internals Image • Voting strategies • External „Black box”voting • Internal „Shape”voting Salt &Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt Recosoft Ltd
Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Original segmentation: Every independent connected component is a character Good segmentation: recognize Bad segmentation: reject Recosoft Ltd
Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recosoft Ltd
Image OP11 Internals Recognize originalsegmentation K.B. • Paprika • Try several segmentations • Loop if unrecognizable Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recognize broken andjoined shapes Recosoft Ltd
Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recognize broken andjoined shapes Train adaptive classifierfrom ‘ugly’ shapes Recosoft Ltd
Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recognize broken andjoined shapes Train adaptive classifierfrom ‘ugly’ shapes Recognize more brokenand joined shapes • Try several segmentations • Loop if unrecognizable Txt 2 Recosoft Ltd
OP11 Internals Image • Voting strategies ~45% gain Salt &Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt Recosoft Ltd
OP12 Image Fire- worx Salt &Pepper • Voting strategies +20% gain Paprika Txt 1A Txt 1B Txt 2 Dict Bronze Final Txt Recosoft Ltd