Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com

Outlines: • Motivation • Background • Approach • Experimental Setup • Results • Conclusion • Future Work

1998: Arabic e-text comes online 2000 1500 1600 1800 1900 1400 1700 E-text becomes commonplace First printing press Read to search Automated full text search Motivation: Problem: 500+ years of legacy documents Goal: To search printed documents efficiently and effectively Does OCR solve the problem?

Arabic Language Challenges • Orthography • Character shape depends on position • 15 of the 28 letters contain dots • Optional diacritics may be present • Printed text may include ligatures and kashida • Morphology • Prefix, infix, and suffix • 6x1010 possible surface forms • Other factors • Eighth most widely spoken language in the world • Web growth started only recently وسـيــكـتبونـهـا wasaya+ktub+uunahaa and will + write + they it = and they will write it

Arabic Pre-processing & Retrieval • Pre-processing: • Remove diacritics • Normalize different forms of alef & ya to accommodate for • Common spelling errors • Grammatical, morphological, and orthographic properties • أ ، آ ، إ ، ا , ؤ , ئ , ءا ,and ى ، يي • Text Retrieval: Best Index Terms • Regular text: Light stemming and character 3 & 4-grams are best • OCR text: character 3 & 4 grams are best

Correction OCR Image Degraded Text Corrected Text Main Idea: Word-Based Correction for Retrieval of Arabic OCR Degraded Documents Word-Based Correction for Retrieval of Arabic OCR Degraded Documents VVorcl-Easod Comectlon l0r Belrieval of Arahie OCR Dcgraclod Doeurnerits We want to examine the effect of correction on Retrieval

Ranked List of Documents Approach: OCR Corrected Text ------------ ------------- OCR Degraded Text -------------- ------------- OCR Correction OCR system Indexing

Experimental Setup: • Test collections • Error Correction • Building Error Model • Training & Decoding • Experiments

Document Collections:

The ZAD Collection: Sample Document: Sample Query: حكم التيمم ومتى شرع

The TREC 2002 CLIR Collection: Sample Document: <DOC> <DOCNO>19940513_AFP_ARB0001</DOCNO> <HEADER> ارا0800 4 ع 7710 قبرص /افب-تصج86 الشرق الاوسط/سلام/حكم ذاتي </HEADER> <BODY> <HEADLINE> &HT; العلم الفلسطيني لم يُرفع فوق كنيس اريحا </HEADLINE> <TEXT> <P> اريحا (الضفة الغربية) 31-5 (اف ب)- يقوم احد عناصر الشرطة الفلسطينية بحراسة مدخل الكنيس اليهودي في وسط اريحا احد آخر مواقع المدينة التي تم تسليمها الى الشرطة الفلسطينية الا انه لم يتم رفع العلم الفلسطيني فوق الكنيس </P> <P> وقال ضابط فلسطيني لفلسطينية كانت تحاول رفع العلم الفلسطيني فوق الكنيس "هذا مكان مقدس" </P> <P> وقبيل ذلك اقترب ثلاثة مستوطنين يهود من مدخل الكنيس الذي كان الجنود الاسرائيليون ما زالوا يوءمنون حراسته وعندما منعهم الجنود من الدخول قاموا بتمزيق ثيابهم </P> </TEXT> Sample Query: سجناء حرب ايرانيين وعراقيين

Training Decoding OCR-Correction Model : OCR Degraded Text Aligning Characters Mapping Build Error Model Manual Corrected OCR Text OCR Degraded Text Pick up most likely correction using Bayes Rule OCR Corrected Text Generate Corrections

Aligning Characters Mapping: 1 : 1 Mapping Ex: walid  vvaicl w  v S Null  v I a  a √ l  Null D i  i √ d  c S Null  l I w a l i d v v a i c l m:n Mapping Ex: walid  vvaicl w  vv S a  a √ l  Null D i  i √ d  cl S w a l i d v v a i c l

Building Error Model: Where CkCl, and DxDy are a character or more

Decoding: Baye’s Rule: P ( Wordcorrect | WordOCR ) = argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) ) Word Level model Character Level model P ( WordOCR | Wordcorrect ) = P ( Wordcorrect ) = LM probability (used simple unigram probability)

Example: dairn 0.425 daim 0.091 claim 0.0091 aim 0.00227 horn 0.00007 d a i r n da i r n d ai r n dai r n d a i rn da ir n d air n dair n d a i rn da i rn d ai rn dai rn d a irn da irn d airn dairn • Character Level Model: • Segmentation • Mapping • Generate Candidates • Ex: dairn εεεεε d a i rn l 0.09 i0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005 d 0.8 h0.1 cl0.08 0.02 i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005 rn 0.7 m0.15 im 0.02 ln 0.015 0.005 a 0.9 o0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005

Example (cont): Word Level Model: Find the Frequency of Occurrence of each generated word in the dictionary P ( dairn | dairn ) = 0.425 P ( daim | dairn ) = 0.091 P ( claim | dairn ) = 0.0091 P ( aim | dairn ) = 0.00227 P ( horn | dairn ) = 0.00007 Freq ( dairn ) = 0 Freq ( daim ) = 0 Freq ( claim ) = 1500 Freq ( aim ) = 4000 Freq ( horn ) = 150 dairn claim

IR Experiments • Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed • The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word • Retrieval performance were tested for all combination between index type and number of correction • Measure of merit is Mean Average Precision • Significance testing done using t-test with p-value = 0.05

Correction Results: ZAD Collection TREC Collection

Clean Bad IR Results: “ZAD Collection”:

Clean Bad IR Results: “TREC Collection”:

Conclusion & future work: • Despite WER was halved IR effectiveness was not improved with statistically significant increase • Using more than one correction does not help • Indexing using n-grams (shorter index terms) is better than “moderate” error correction • Effect of using n-gram word LM on error correction • “Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006” • Effect of “good” error correction on improving the retrieval effectiveness

Correction Thank you Lnanh gon

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Presentation Transcript

Measures of rehabilitation of degraded lands

The Effect of Rotational Raman Scattering on Ozone Profile Retrieval

Pre-SWOT Report. Printed Arabic OCR

Word Correction Game

Arabic Word Segmentation for Better Unit of Analysis

Levels of breakdown in impaired word retrieval

Image Retrieval Based on the Wavelet Features of Interest

Imaged Document Text Retrieval without OCR

Language Model Based Arabic Word Segmentation

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Ranking Documents based on Relevance of Semantic Relationships

Score-based ranking of the documents

Ontology–based author profiling of documents

Background removal in degraded documents

The Effect of…on…..

The Effect of Translation Quality in MT-Based Cross-Language Information Retrieval

Formatting Word Documents

Color Image Retrieval based on Primitives of Color Moments

Information Retrieval in Distributed Environments Based on Context-Aware, Proactive Documents

Image Retrieval Based on Regions of Interest

Editing word documents

Word To Word Arabic Document Translation Service