120 likes | 134 Views
Evaluate the effectiveness of IR for retrieving machine-printed text in Indic scripts. Support experimentation and collaboration between IR and OCR researchers. RISOT 2011 dataset includes Bengali newspaper articles. RISOT 2012 includes Devanagari (Hindi) dataset.
E N D
Overview of RISOT:Retrieval of Indic Script OCR’d Text Utpal Garain Indian Statistical Institute, Kolkata Tamaltaru Pal Indian Statistical Institute, Kolkata Jiaul Paik Indian Statistical Institute, Kolkata Kripa Ghosh Indian Statistical Institute, Kolkata David Doermann University of Maryland, College Park, USA Douglas W. Oard University of Maryland, College Park, USA
Task • Evaluate retrieval of automatically recognized text from machine printed text • Goals • Support experimentation of retrieval from printed documents • Evaluate IR effectiveness for retrieval based on Indic script OCR • Provide venue where IR and OCR researchers can work together
RISOT 2011 • Bengali newspaper articles • About half the FIRE 2008/2010 collection • 62,875 documents • Text • Rendered image • OCR’d text • 66 topics
RISOT 2011 • Two teams participated • Techniques • OCR error modeling • Query time stemming • Best absolute OCR results resulted from stemming + error modeling • 83% the TEXT MAP for TD queries • Best same-team relative MAP 90% of TEXT • 88% for P@10
Further experiments on RISOT 2011 Data • N-gram statistics were used • Stemming beats words or n-grams • Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model
CLIR • English query Bengali collection (OCR’d) • Dictionary based translation • Transliteration of OOVs • Additional resources • Stemming • OCR error modeling
Addition in 2012 • Devanagari (Hindi) Dataset • 94,432 articles from two newspaper • Subset of FIRE data • Text • Rendered image • OCR’d • 28 topics • Tasks • OCR Post-processing • Retrieval from Bengali OCR’d text • Retrieval from Devanagari (Hindi) OCR’d Text
RISOT Runs • One team participated • ISI team • KripabandhuGhosh and AnirbanChakraborty • Method • Did not use previous OCR error modeling technique • Assumed that clean text is not available • Co-occurrence based synonym searching • tobacc, 1obacco, etc. are synonyms of tobacco
RISOT Results • OCR error modeling gave better improvement
RISOT Future • Next RISOT will introduce image degradation • Module of OCRopus • LAMP, UMD tool • How to attract more teams • Involvement of OCR consortium • Better OCR • Better error modeling • Summer code projects • Once in two years