230 likes | 407 Views
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents. Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com. Outlines:. Motivation Background Approach Experimental Setup Results Conclusion
E N D
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com
Outlines: • Motivation • Background • Approach • Experimental Setup • Results • Conclusion • Future Work
1998: Arabic e-text comes online 2000 1500 1600 1800 1900 1400 1700 E-text becomes commonplace First printing press Read to search Automated full text search Motivation: Problem: 500+ years of legacy documents Goal: To search printed documents efficiently and effectively Does OCR solve the problem?
Arabic Language Challenges • Orthography • Character shape depends on position • 15 of the 28 letters contain dots • Optional diacritics may be present • Printed text may include ligatures and kashida • Morphology • Prefix, infix, and suffix • 6x1010 possible surface forms • Other factors • Eighth most widely spoken language in the world • Web growth started only recently وسـيــكـتبونـهـا wasaya+ktub+uunahaa and will + write + they it = and they will write it
Arabic Pre-processing & Retrieval • Pre-processing: • Remove diacritics • Normalize different forms of alef & ya to accommodate for • Common spelling errors • Grammatical, morphological, and orthographic properties • أ ، آ ، إ ، ا , ؤ , ئ , ءا ,and ى ، يي • Text Retrieval: Best Index Terms • Regular text: Light stemming and character 3 & 4-grams are best • OCR text: character 3 & 4 grams are best
Correction OCR Image Degraded Text Corrected Text Main Idea: Word-Based Correction for Retrieval of Arabic OCR Degraded Documents Word-Based Correction for Retrieval of Arabic OCR Degraded Documents VVorcl-Easod Comectlon l0r Belrieval of Arahie OCR Dcgraclod Doeurnerits We want to examine the effect of correction on Retrieval
Ranked List of Documents Approach: OCR Corrected Text ------------ ------------- OCR Degraded Text -------------- ------------- OCR Correction OCR system Indexing
Experimental Setup: • Test collections • Error Correction • Building Error Model • Training & Decoding • Experiments
The ZAD Collection: Sample Document: Sample Query: حكم التيمم ومتى شرع
The TREC 2002 CLIR Collection: Sample Document: <DOC> <DOCNO>19940513_AFP_ARB0001</DOCNO> <HEADER> ارا0800 4 ع 7710 قبرص /افب-تصج86 الشرق الاوسط/سلام/حكم ذاتي </HEADER> <BODY> <HEADLINE> &HT; العلم الفلسطيني لم يُرفع فوق كنيس اريحا </HEADLINE> <TEXT> <P> اريحا (الضفة الغربية) 31-5 (اف ب)- يقوم احد عناصر الشرطة الفلسطينية بحراسة مدخل الكنيس اليهودي في وسط اريحا احد آخر مواقع المدينة التي تم تسليمها الى الشرطة الفلسطينية الا انه لم يتم رفع العلم الفلسطيني فوق الكنيس </P> <P> وقال ضابط فلسطيني لفلسطينية كانت تحاول رفع العلم الفلسطيني فوق الكنيس "هذا مكان مقدس" </P> <P> وقبيل ذلك اقترب ثلاثة مستوطنين يهود من مدخل الكنيس الذي كان الجنود الاسرائيليون ما زالوا يوءمنون حراسته وعندما منعهم الجنود من الدخول قاموا بتمزيق ثيابهم </P> </TEXT> Sample Query: سجناء حرب ايرانيين وعراقيين
Training Decoding OCR-Correction Model : OCR Degraded Text Aligning Characters Mapping Build Error Model Manual Corrected OCR Text OCR Degraded Text Pick up most likely correction using Bayes Rule OCR Corrected Text Generate Corrections
Aligning Characters Mapping: 1 : 1 Mapping Ex: walid vvaicl w v S Null v I a a √ l Null D i i √ d c S Null l I w a l i d v v a i c l m:n Mapping Ex: walid vvaicl w vv S a a √ l Null D i i √ d cl S w a l i d v v a i c l
Building Error Model: Where CkCl, and DxDy are a character or more
Decoding: Baye’s Rule: P ( Wordcorrect | WordOCR ) = argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) ) Word Level model Character Level model P ( WordOCR | Wordcorrect ) = P ( Wordcorrect ) = LM probability (used simple unigram probability)
Example: dairn 0.425 daim 0.091 claim 0.0091 aim 0.00227 horn 0.00007 d a i r n da i r n d ai r n dai r n d a i rn da ir n d air n dair n d a i rn da i rn d ai rn dai rn d a irn da irn d airn dairn • Character Level Model: • Segmentation • Mapping • Generate Candidates • Ex: dairn εεεεε d a i rn l 0.09 i0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005 d 0.8 h0.1 cl0.08 0.02 i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005 rn 0.7 m0.15 im 0.02 ln 0.015 0.005 a 0.9 o0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005
Example (cont): Word Level Model: Find the Frequency of Occurrence of each generated word in the dictionary P ( dairn | dairn ) = 0.425 P ( daim | dairn ) = 0.091 P ( claim | dairn ) = 0.0091 P ( aim | dairn ) = 0.00227 P ( horn | dairn ) = 0.00007 Freq ( dairn ) = 0 Freq ( daim ) = 0 Freq ( claim ) = 1500 Freq ( aim ) = 4000 Freq ( horn ) = 150 dairn claim
IR Experiments • Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed • The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word • Retrieval performance were tested for all combination between index type and number of correction • Measure of merit is Mean Average Precision • Significance testing done using t-test with p-value = 0.05
Correction Results: ZAD Collection TREC Collection
Clean Bad IR Results: “ZAD Collection”:
Clean Bad IR Results: “TREC Collection”:
Conclusion & future work: • Despite WER was halved IR effectiveness was not improved with statistically significant increase • Using more than one correction does not help • Indexing using n-grams (shorter index terms) is better than “moderate” error correction • Effect of using n-gram word LM on error correction • “Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006” • Effect of “good” error correction on improving the retrieval effectiveness
Correction Thank you Lnanh gon