1 / 12

Overview of RISOT: Retrieval of Indic Script OCR’d Text

Evaluate the effectiveness of IR for retrieving machine-printed text in Indic scripts. Support experimentation and collaboration between IR and OCR researchers. RISOT 2011 dataset includes Bengali newspaper articles. RISOT 2012 includes Devanagari (Hindi) dataset.

shirleyo
Download Presentation

Overview of RISOT: Retrieval of Indic Script OCR’d Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of RISOT:Retrieval of Indic Script OCR’d Text Utpal Garain Indian Statistical Institute, Kolkata Tamaltaru Pal Indian Statistical Institute, Kolkata Jiaul Paik Indian Statistical Institute, Kolkata Kripa Ghosh Indian Statistical Institute, Kolkata David Doermann University of Maryland, College Park, USA Douglas W. Oard University of Maryland, College Park, USA

  2. Task • Evaluate retrieval of automatically recognized text from machine printed text • Goals • Support experimentation of retrieval from printed documents • Evaluate IR effectiveness for retrieval based on Indic script OCR • Provide venue where IR and OCR researchers can work together

  3. RISOT 2011 • Bengali newspaper articles • About half the FIRE 2008/2010 collection • 62,875 documents • Text • Rendered image • OCR’d text • 66 topics

  4. RISOT 2011 • Two teams participated • Techniques • OCR error modeling • Query time stemming • Best absolute OCR results resulted from stemming + error modeling • 83% the TEXT MAP for TD queries • Best same-team relative MAP 90% of TEXT • 88% for P@10

  5. Further experiments on RISOT 2011 Data • N-gram statistics were used • Stemming beats words or n-grams • Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model

  6. CLIR • English query Bengali collection (OCR’d) • Dictionary based translation • Transliteration of OOVs • Additional resources • Stemming • OCR error modeling

  7. CLIR Results

  8. Addition in 2012 • Devanagari (Hindi) Dataset • 94,432 articles from two newspaper • Subset of FIRE data • Text • Rendered image • OCR’d • 28 topics • Tasks • OCR Post-processing • Retrieval from Bengali OCR’d text • Retrieval from Devanagari (Hindi) OCR’d Text

  9. RISOT Runs • One team participated • ISI team • KripabandhuGhosh and AnirbanChakraborty • Method • Did not use previous OCR error modeling technique • Assumed that clean text is not available • Co-occurrence based synonym searching • tobacc, 1obacco, etc. are synonyms of tobacco

  10. RISOT Results • OCR error modeling gave better improvement

  11. RISOT Future • Next RISOT will introduce image degradation • Module of OCRopus • LAMP, UMD tool • How to attract more teams • Involvement of OCR consortium • Better OCR • Better error modeling • Summer code projects • Once in two years

More Related