Statistical Methods for Text Error Correction

Statistical Methods for Text Error Correction By Walid Magdy Supervision of Prof. Dr. MohsenRashwan Prof. Dr. Kareem Darwish

Outline • Motivation • Prior Work • Contribution • Text Fusion • Omni-Font Error Correction • Integrated system • Conclusion • Possible Future Directions

Motivation • Massive efforts is moving toward digitization • Digitization is for: Availability & Information Retrieval • OCR is the main enabling technology • OCR systems is far from perfect • Poor quality OCR = Low readability & Low IR effectiveness • Arabic OCR accuracy is much lower than other languages • The need of higher quality text for Arabic documents became a must for improving readability & IR effectiveness.

Prior Work • Previous work on OCR focused on: • Building better OCR systems • Improving text quality through text correction • Improving IR effectiveness regardless of improving text quality • Examples: • Sakhr & RDI OCR systems • OCR correction based on character error model and Language model • Index term selection for indexing degraded text • Query garbling based on character error model

Contribution • Fusion of multiple OCR output text • Omni-Font OCR error correction • Information retrieval improvement for degraded text • System design that can reduce errors in degraded text by more than 80%

Text Fusion Outlines • Definition • Approach • Implementation • Experimental Setup • Results

S1 = S0 + ε1 S2 = S0 + ε2 Sn = S0 + εn Text Fusion Previous approaches depends on the presence of only one source of degraded text. Our approach assumes the presence of more than one version of the degraded text. Definition Degraded version of text Clean version of text Noisy edit operations OCR Simage Sx = S0 + εx Correction Sx = S0 + εx S0’ = S0 + ε0’ ε0’ < εx Fusion ε0’ < min(ε1 … εn) S0’ = S0 + ε0’

Text Fusion Approach Image OCR System1 OCR System2 ولاثديإلافيألمستدلالهبنور4ولاحيا4إلا في رضا4 ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 ولم هدء! إلم فيالاستدلالبنورهولمحياةملأفيرضا5 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 Text Fusion ولا ثدي إلا في الاستدلال بنوره ولا حياة إلا في رضا5

Text Fusion Implementation OCR Text 1 Fused Text Word Alignment Best Fitting Word Selection OCR Text 2 Language Model OCR1: W1 W2 W3 W4 ….. Wn-2Wn-1 Wn OCR2: W1 W2 W3 W4 ….. Wm-2 Wm-1 Wm

Text Fusion Fusion was tested in two ways:1. Effect on Error Reduction2. Effect on Information Retrieval ZAD Alma’adbook was used for test, which contains- OCR output using Sakhr- Clean version of the book- Queries with relevance judgments A tri-gram LM from Ibn-Taymia books Two OCR systems were available:- Sakhr Automatic Reader- RDI OCR system Experimental Setup

Text Fusion Ten clean pages were selected atrandom & printed in three different fonts (Kufi, Mudir, and Simplified Arabic) Set contains 4,200 words with 0.9% of words are OOV Each version of text is scanned with two different resolutions (200x200 dpi & 300x300dpi) Each scanned version is OCRed using both OCR systems (RDI & Sakhr) Different versions were fused, & CER’s & WER’s were checked for all version (Original & Fused versions) Effect on Error Reduction (1/2)

Text Fusion Effect on Error Reduction (2/2) WER for different versions of ZAD test set Original Version Fusion Version Kufi Mudir Simplified

Text Fusion Relevance judgments on ZAD was built on the whole book (2,730 documents) To produce multiple versions of degraded text of ZAD: Sample of the original OCR version were manually corrected Degraded and Clean text were used to create a character error model based on 1:1 character mapping Generated model is then used to garble clean version with different CER’s Different versions were fused, them MAP was used as the figure of merit for IR results Results showed some improvement in IR effectiveness Effect on Retrieval Effectiveness

Omni-Font Correction Outlines • Idea • Implementation • Experimental Setup • Results

Omni-Font Correction Idea • Character Error Model • بب0.8 • بي0.05 • بد0.05 • بق0.03 • بذ 0.02 • Domain • Politics • Religious • Science • Sports • Candidates (Dictionary) • الاستقلال • الاستبدال • الاستدلال • الاستبلاد • الاستغلال • الاستبسال • الاستذلال Religious الاستبلال • Context • على وجوده • لله تعـالـى

Omni-Font Correction Implementation OCR Text Corrected Text Generate Candidates Best Fitting Word Selection Edit Distance Language Model Dictionary

Omni-Font Correction Correction was tested in two ways:1. Effect on Error Reduction2. Effect on Information Retrieval ZAD was used for the experiments, and another collection (AFP (TREC)) was used to test the system in a different domain (news domain) Two LMs were used to test the AFP collection:1. LM built from same time period of AFP2. LM built from different time period of AFP All results were compared to previous work in correction using the character error model Experimental Setup

Omni-Font Correction Effect of Error Reduction ZAD TREC

Omni-Font Correction Effect of Retrieval Results in MAP for searching different versions of the ZAD collection

Integrated System Outlines • Possible Implementations • Detailed Implementation • Experimental Setup • Results

Integrated System Possible Implementations Degraded Text Degraded Text Less degraded Text Less degraded Text Correction Fusion Fusion Correction

Integrated System Implementation Degraded Text Version1 Degraded Text Version2 Fused Text Version Much lower errors version Text Correction Fusion Degraded Text Versionn

Integrated System Correction was applied on all fused versions of text shown in fusion section (versions of ZAD) Correction was applied in two different manners:1. Whole text correction2. OOV text correction For fused versions, 55% of WER are OOV Experimentation Setup

Integrated System Results

Conclusion (1/2) • Text fusion proved to be mostly effective • Fusion of two OCRed text of the same image with different resolution improves text quality • Omni-font correction proved its effectiveness on error reduction • Using trained character error model have better effect on error reduction than uniform one. However, both has indistinguishable IR effectiveness.

Conclusion (2/2) • The key for better correction is using a well trained language model • An integrated system that comprise the effectiveness of text fusion and error correction proved its ability on achieving a significant reduction in errors: • Average Error Reduction = 56% • Max. Error Reduction = 86% • Max. Theoretical Error Reduction: WER = OOV

Possible Future Directions • Applying Text Fusion on character level instead of word level • Applying different implementations for the integrated system • Using Factored language model • Applying all experiments for different types of degraded text, such as ASR • Testing the usage of a huge amount of data for creating a general LM instead of a domain specific LM

بزاكم الته خيرا جزاكم الله خيرا جزاكم الله خبرا

Statistical Methods for Text Error Correction