1 / 23

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents. Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com. Outlines:. Motivation Background Approach Experimental Setup Results Conclusion

phiala
Download Presentation

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com

  2. Outlines: • Motivation • Background • Approach • Experimental Setup • Results • Conclusion • Future Work

  3. 1998: Arabic e-text comes online 2000 1500 1600 1800 1900 1400 1700 E-text becomes commonplace First printing press Read to search Automated full text search Motivation: Problem: 500+ years of legacy documents Goal: To search printed documents efficiently and effectively Does OCR solve the problem?

  4. Arabic Language Challenges • Orthography • Character shape depends on position • 15 of the 28 letters contain dots • Optional diacritics may be present • Printed text may include ligatures and kashida • Morphology • Prefix, infix, and suffix • 6x1010 possible surface forms • Other factors • Eighth most widely spoken language in the world • Web growth started only recently وسـيــكـتبونـهـا wasaya+ktub+uunahaa and will + write + they it = and they will write it

  5. Arabic Pre-processing & Retrieval • Pre-processing: • Remove diacritics • Normalize different forms of alef & ya to accommodate for • Common spelling errors • Grammatical, morphological, and orthographic properties • أ ، آ ، إ ، ا , ؤ , ئ , ءا ,and ى ، يي • Text Retrieval: Best Index Terms • Regular text: Light stemming and character 3 & 4-grams are best • OCR text: character 3 & 4 grams are best

  6. Correction OCR Image Degraded Text Corrected Text Main Idea: Word-Based Correction for Retrieval of Arabic OCR Degraded Documents Word-Based Correction for Retrieval of Arabic OCR Degraded Documents VVorcl-Easod Comectlon l0r Belrieval of Arahie OCR Dcgraclod Doeurnerits We want to examine the effect of correction on Retrieval

  7. Ranked List of Documents Approach: OCR Corrected Text ------------ ------------- OCR Degraded Text -------------- ------------- OCR Correction OCR system Indexing

  8. Experimental Setup: • Test collections • Error Correction • Building Error Model • Training & Decoding • Experiments

  9. Document Collections:

  10. The ZAD Collection: Sample Document: Sample Query: حكم التيمم ومتى شرع

  11. The TREC 2002 CLIR Collection: Sample Document: <DOC> <DOCNO>19940513_AFP_ARB0001</DOCNO> <HEADER> ارا0800 4 ع 7710 قبرص /افب-تصج86 الشرق الاوسط/سلام/حكم ذاتي </HEADER> <BODY> <HEADLINE> &HT; العلم الفلسطيني لم يُرفع فوق كنيس اريحا </HEADLINE> <TEXT> <P> اريحا (الضفة الغربية) 31-5 (اف ب)- يقوم احد عناصر الشرطة الفلسطينية بحراسة مدخل الكنيس اليهودي في وسط اريحا احد آخر مواقع المدينة التي تم تسليمها الى الشرطة الفلسطينية الا انه لم يتم رفع العلم الفلسطيني فوق الكنيس </P> <P> وقال ضابط فلسطيني لفلسطينية كانت تحاول رفع العلم الفلسطيني فوق الكنيس "هذا مكان مقدس" </P> <P> وقبيل ذلك اقترب ثلاثة مستوطنين يهود من مدخل الكنيس الذي كان الجنود الاسرائيليون ما زالوا يوءمنون حراسته وعندما منعهم الجنود من الدخول قاموا بتمزيق ثيابهم </P> </TEXT> Sample Query: سجناء حرب ايرانيين وعراقيين

  12. Training Decoding OCR-Correction Model : OCR Degraded Text Aligning Characters Mapping Build Error Model Manual Corrected OCR Text OCR Degraded Text Pick up most likely correction using Bayes Rule OCR Corrected Text Generate Corrections

  13. Aligning Characters Mapping: 1 : 1 Mapping Ex: walid  vvaicl w  v S Null  v I a  a √ l  Null D i  i √ d  c S Null  l I w a l i d v v a i c l m:n Mapping Ex: walid  vvaicl w  vv S a  a √ l  Null D i  i √ d  cl S w a l i d v v a i c l

  14. Building Error Model: Where CkCl, and DxDy are a character or more

  15. Decoding: Baye’s Rule: P ( Wordcorrect | WordOCR ) = argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) ) Word Level model Character Level model P ( WordOCR | Wordcorrect ) = P ( Wordcorrect ) = LM probability (used simple unigram probability)

  16. Example: dairn 0.425 daim 0.091 claim 0.0091 aim 0.00227 horn 0.00007 d a i r n da i r n d ai r n dai r n d a i rn da ir n d air n dair n d a i rn da i rn d ai rn dai rn d a irn da irn d airn dairn • Character Level Model: • Segmentation • Mapping • Generate Candidates • Ex: dairn εεεεε d a i rn l 0.09 i0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005 d 0.8 h0.1 cl0.08 0.02 i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005 rn 0.7 m0.15 im 0.02 ln 0.015 0.005 a 0.9 o0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005

  17. Example (cont): Word Level Model: Find the Frequency of Occurrence of each generated word in the dictionary P ( dairn | dairn ) = 0.425 P ( daim | dairn ) = 0.091 P ( claim | dairn ) = 0.0091 P ( aim | dairn ) = 0.00227 P ( horn | dairn ) = 0.00007 Freq ( dairn ) = 0 Freq ( daim ) = 0 Freq ( claim ) = 1500 Freq ( aim ) = 4000 Freq ( horn ) = 150 dairn claim

  18. IR Experiments • Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed • The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word • Retrieval performance were tested for all combination between index type and number of correction • Measure of merit is Mean Average Precision • Significance testing done using t-test with p-value = 0.05

  19. Correction Results: ZAD Collection TREC Collection

  20. Clean Bad IR Results: “ZAD Collection”:

  21. Clean Bad IR Results: “TREC Collection”:

  22. Conclusion & future work: • Despite WER was halved IR effectiveness was not improved with statistically significant increase • Using more than one correction does not help • Indexing using n-grams (shorter index terms) is better than “moderate” error correction • Effect of using n-gram word LM on error correction • “Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006” • Effect of “good” error correction on improving the retrieval effectiveness

  23. Correction Thank you Lnanh gon

More Related