1 / 14

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval

This research explores the fusion of multiple corrupted transmissions to improve the effectiveness of information retrieval. It focuses on Arabic documents that are only available in print form and aims to transform them into electronic form for easier searching. The study compares different approaches and experimental setups to determine the impact of fusion on text quality and retrieval results. The findings show the promising potential of text fusion for improving information retrieval, and future work involves testing the technique on real degraded data from different sources.

Download Presentation

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan

  2. Outlines • Motivation • Prior work • Fusion Definition • Approach • Experimental Setup • Results • Conclusion & Future work

  3. Motivation • Many Arabic documents are available only in print form. • The need of transforming these documents into electronic form increased since the end of last century, where searching E-text is much easier. • Arabic OCR accuracy is still much lower than the state-of-the-art for other languages, such as English. • Degraded text, resulting from OCR systems, affects the effectiveness of Information Retrieval. • The need for having higher quality text for Arabic documents became a must for improving IR effectiveness.

  4. Prior Art: • Previous work on OCRed text focused on two main aspects: • Work involves improving Information Retrieval effectiveness regardless of improving text quality. • Work focuses on improving text quality leading to improvement in IR effectiveness. • Examples: • Query garbling based on character error model. • OCR correction based on character error model and Language model.

  5. Fusion S0’ = S0 + ε0’ S1 = S0 + ε1 S2 = S0 + ε2 Sn = S0 + εn Degraded version of text Fusion Definition: Clean version of text Noisy edit operations • Previous approaches depends on the presence of only one source of degraded text. • Our approach assumes the presence of more than one version of the degraded text. Correction OCR Simage Sx = S0 + εx Sx = S0 + εx S0’ = S0 + ε0’ ε0’ < εx ε0’ < min(ε1 … εn)

  6. Approach: Image OCR System1 OCR System2 ولا ثدي إلا في ألمستدلالهبنور4 ولا حيا4 إلا في رضا4 ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 Language Model ولا ثدي إلافي الاستدلال بنوره ولا حياة إلا في رضا5

  7. Experimental Setup: • Only one OCR system was available “Sakhr Automatic Reader v4”. • In order to obtain multiple sources for a given data set: • Few pages were selected at random from a book, OCRed, then outcome text was manually corrected. • Degraded and Clean text were used to create a character error model based on 1:1 character mapping. • Generated model is then used to garble a clean text using different CER’s. • Used OCRed book for test was ZadAlma’ad, with the following specs: • Eight pages scanned at 300x300 dpi that contain 4,236 words, with CER of 13.9% and WER 36.8%. • Clean version of the book was available in electronic form that consists of 2,730 separate documents. Associated a set of 25 topics and relevance judgments. • LM is built using a web-mined collection of religious text by IbnTaymiya, the teacher of the author of ZadAlma’ad • MAP was used as the figure of merit for IR results.

  8. Experimental Setup:Generating Synthetic Garbled Data • For a clean word “قنبلة” ق ق 0.8 ف 0.1 ت0.05 ن0.05 Generate random number قـنـبـلـة قـنـبـلـة تـنـبـلـة تـ Garbler Character Error Model 0.95 0.8 0.9 1 0.0 0.921 ق ف ت ن

  9. Experimental Setup:Generating Synthetic Garbled Data CERnew CERorg k = k = 2 0.95 0.9 1 0.9 0.8 0.9 1 1 0.8 0.0 0.0 0.0 0.6 0.95 0.975 k = 0.5 ق ف ت ن ق ف ت ن ق ف ت ن

  10. Model-1 Model-2 Model-3 Model-4 Model-5 Experimental Setup:Generated Versions Error rates for generated versions Retrieval results for generated versions

  11. Results:Fusion Results WER after fusion of both versions Common Errors between versions WER for outcome text from fusion process between couples of versions

  12. Results:Retrieval Results Results in MAP of searching different fused models, hashed bars refers to statistical significant retrieval results better than the original degraded versions

  13. Conclusion & Future Work: • Text fusion proved to be an effective method for selecting the proper word among different candidate words coming from different sources. • Effectiveness of text fusion on WER reduction depends on the percentage of error overlap among different versions. • Information retrieval improvement as a cause of text fusion was found to be promising specially for the few outcome versions that are statistically indistinguishable from the clean version. • As a future work, fusion technique needs to be tested on real degraded data coming from different sources that will introduce a new challenge, which is word alignment among different sources.

  14. بزاكم الته خيرا جزاكم الله خيرا جزاكم الله خبرا

More Related