1 / 42

Omnipage Pro

SSIP 2002, Budapest. Omnipage Pro. Internal Structure of the Character Recognition Engine used inside. Dr. István Marosi Recosoft Ltd., Hungary. Some „Marketing talk”. Main tasks of an OCR system: Image acquisition Layout recognition Text recognition User assisted correction

dorothydunn
Download Presentation

Omnipage Pro

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SSIP 2002, Budapest Omnipage Pro Internal Structure of the Character Recognition Engine used inside Dr. István Marosi Recosoft Ltd., Hungary

  2. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • Result exportation Recosoft Ltd

  3. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Get image • B/W Scanning • Gray Scanning • Color Scanning • Load from image file • Preprocess image • Layout recognition • Text recognition • User assisted correction • Result exportation Recosoft Ltd

  4. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Get image • Preprocess image • Color separation • Thresholding • Despeckling • Rotation • Deskewing • Layout recognition • Text recognition • User assisted correction • Result exportation Recosoft Ltd

  5. The Preprocessed Image Joined chars Recosoft Ltd

  6. The Preprocessed Image Joined chars Recosoft Ltd

  7. The Preprocessed Image Broken chars Recosoft Ltd

  8. The Preprocessed Image Broken chars Recosoft Ltd

  9. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text zones • Columns of flowed text • Tables • Inverse text • Graphic zones • Text recognition • User assisted correction • Result exportation Recosoft Ltd

  10. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text zones • Graphic zones • Line Art • Photo • Text recognition • User assisted correction • Result exportation Recosoft Ltd

  11. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • ... Let’s do it when the marketing staff is over... • User assisted correction • Result exportation Recosoft Ltd

  12. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • By the user’s random editing... • Pop-up verifier • Manual Training • By proofreading of doubtful words • Result exportation Recosoft Ltd

  13. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • By the user’s random editing... • By proofreading of doubtful words • Correct: User dictionary • Changed: IntelliTrain • Remember trained characters • Apply them on following pages • Result exportation Recosoft Ltd

  14. IntelliTrain Recognized word: sorneUüng Recosoft Ltd

  15. IntelliTrain Recognized word: sorneUüng Fixed word: something Recosoft Ltd

  16. IntelliTrain Recognized word: sorneUüng Fixed word: something Recosoft Ltd

  17. IntelliTrain Recognized word: sorneUüng Fixed word: something Substitutions found: m rn thiUü Recosoft Ltd

  18. IntelliTrain Recognized word: sorneUüng Fixed word: something Substitutions found: m rn thiUü Perform automatically: • Learn image pattern and substitution info • Find similar substituted (‘blue’) text on actual page • Match against pattern of substitution and correct • Find such errors on following pages, too Recosoft Ltd

  19. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • Result exportation • Combine pages into a Document • Header / Footer recognition • Page numbers • Hyperlinks (e.g. „See Table 20”) • Save results Recosoft Ltd

  20. Some „Marketing talk” • Main tasks of an OCR system: • Image acquisition • Layout recognition • Text recognition • User assisted correction • Result exportation • Combine pages into a Document • Save results • doc file • e-mail • Speech synthesizer Recosoft Ltd

  21. OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (codename: Salt & Pepper) • Recognita’s engine (codename: Paprika) Recosoft Ltd

  22. OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Uses a Matrix Matching based algorithm • feature set: 40 cells of an 8x5 grid • good overall description of a shape • weaker at detailed structure • Recognita’s engine (Paprika) • Uses a Contour Tracing based algorithm • feture set: convex and concave arcs on the contour • good detailed description of a shape • weaker at overall structure Recosoft Ltd

  23. OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Recognita’s engine (Paprika) • Segmentation algorithms: Recosoft Ltd

  24. Segmentation What are those pixel groups belonging to a single letter?

  25. Segmentation What are those pixel groups belonging to a single letter?

  26. Segmentation What are those pixel groups belonging to a single letter?

  27. Segmentation What are those pixel groups belonging to a single letter?

  28. Segmentation What are those pixel groups belonging to a single letter?

  29. Segmentation What are those pixel groups belonging to a single letter?

  30. OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Recognita’s engine (Paprika) • Segmentation algorithms: • Developed by independent groups • Have different strengths and weaknesses Recosoft Ltd

  31. OP11 Internals • Text recognition in ScanSoft’s OP11 • OCR Engines available: • Caere’s engine (Salt & Pepper) • Recognita’s engine (Paprika) • Segmentation algorithms • Conclusion: • They are complementary • Let’s create a voting system Recosoft Ltd

  32. OP11 Internals Image • Voting strategies • External „Black box”voting Paprika Salt &Pepper Txt 2 Txt 1 Vote? Final Txt Recosoft Ltd

  33. OP11 Internals Image • Voting strategies • External „Black box”voting Paprika Salt &Pepper Txt 2 Txt 1 Dict Vote Final Txt Recosoft Ltd

  34. OP11 Internals Image • Voting strategies • External „Black box”voting~15% gain Paprika Salt &Pepper Txt 2 Txt 1 Dict Vote Final Txt Recosoft Ltd

  35. OP11 Internals Image • Voting strategies • External „Black box”voting • Internal „Shape”voting Salt &Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt Recosoft Ltd

  36. Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Original segmentation: Every independent connected component is a character Good segmentation: recognize Bad segmentation: reject Recosoft Ltd

  37. Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recosoft Ltd

  38. Image OP11 Internals Recognize originalsegmentation K.B. • Paprika • Try several segmentations • Loop if unrecognizable Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recognize broken andjoined shapes Recosoft Ltd

  39. Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recognize broken andjoined shapes Train adaptive classifierfrom ‘ugly’ shapes Recosoft Ltd

  40. Image OP11 Internals Recognize originalsegmentation K.B. • Paprika Train adaptive classifierfrom original shapes Txt 1 AdaptiveK.B. Recognize broken andjoined shapes Train adaptive classifierfrom ‘ugly’ shapes Recognize more brokenand joined shapes • Try several segmentations • Loop if unrecognizable Txt 2 Recosoft Ltd

  41. OP11 Internals Image • Voting strategies ~45% gain Salt &Pepper Paprika Txt 1 Txt 2 Dict Bronze Final Txt Recosoft Ltd

  42. OP12 Image Fire- worx Salt &Pepper • Voting strategies +20% gain Paprika Txt 1A Txt 1B Txt 2 Dict Bronze Final Txt Recosoft Ltd

More Related