220 likes | 321 Views
Robust Recognition of Documents by Fusing Results of Word Clusters. Venkat Rasagna 1 , Anand Kumar 1 , C. V. Jawahar 1 , R. Manmatha 2 1 Center for Visual Information Technology, IIIT- Hyderabad 2 Center for Intelligent Information Retrieval, UMASS - Amherst. Introduction.
E N D
Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna1, Anand Kumar1, C. V. Jawahar1, R. Manmatha2 1Center for Visual Information Technology, IIIT- Hyderabad 2Center for Intelligent Information Retrieval, UMASS - Amherst
Introduction • Recognition of books and collections. • Recognition of words is crucial to Information Retrieval. • Use of dictionaries and post processors are not feasible in many languages.
Motivation • Most of the (Indian language) OCRs recognize glyph(component) and generate text from the class labels. • Word accuracies are far lower than component accuracies. • Word accuracy is inversely proportional to no. of components in the word. • Use of language model for post processing is challenging. • High entropy, Large vocabulary (eg. Telugu). • Language processing modules still emerging. Recognize Parse 100 50 word acc Component acc. word acc Is it possible to make use of multiple occurrence of the same word to improve OCR performance ? 0 Component Accuracy = 9 / 12 = 75% Word Accuracy = 25% Average word length = No of components
Overview OCR output Goal OCR OCR OCR OCR OCR Multiple occurrences of a word • Words are degraded • independently • OCR output is different for • the word at different • instances Cluster Text
Related Work • Character Recognition in Indian languages is still an unsolved problem. • Telugu is one of the most complex scripts. • Recognition of a book has received some attention recently. • Word images are efficiently matched for retreival. • Use of word image clusters to improve OCR accuracy Malayalam Bangla Hindi Tamil 1A. Negi et al., ICDAR, 2001 ; 2C. V. Jawahar et al., ICDAR, 2003; 3K. S. Sesh Kumar et al., ICDAR 2007 H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995 U. Pal, B. Chaudhuri, Pattern Recognition, 2004 1T. M. Rath et al., IJDAR, 2007;2T. M. Rath et al., CVPR, 2003;3Anand Kumar et al., ACCV, 2007 1P. Xiu and H. S. Baird, DRR XV,2008; 2N. V. Neeba, C. V. Jawahar, ICPR, 2008
Conventional Recognition Process Proposed Recognition Process Scanned Images Segmentation and Word detection Preprocessing Word Grouping (Clustering) Feature Extraction Classification Word groups Word level Feature Extraction Recognizer Grouping Text (UNICODE) Combining OCR Results
Locality Sensitive Hashing (LSH) • LSH Goal: “r-Near Neighbour” • for any query q, return a point p∈P such that||p-q|| ≤ r (if it exists) • LSH has been used for • Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000 • Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006 • Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007
Character Majority Voting • Algorithm [TODO] Word Cluster OCR output Components Final Output
Dynamic Programming Voting for 1 after aligning Dynamic Programming [1,2] DTW o/p for word 1 = CMV o/p for word 1 = Alignment
Results • Word generation process makes correct annotations available for evaluating the performance. • 5000 clusters • 20 variations • Degraded dataset More Details
Results • Word Accuracy Vs No. of words • Adding more no. of words makes the data set more ambiguous • Algorithm performance increases with no. of words, and saturates. • Word Accuracy Vs Word Length • Word accuracy decreases as the word length increase. • Use of the cluster info helps in gaining good word accuracies.
Results • For a small increase in component accuracy, there is a large • improvement in the word accuracy. • The improvement is high for long words. • Relative improvement of 12% for words which occur at least twice.
Analysis • Cuts and Merges • CMV vs. DTW • Wrong word in the cluster. • Cases that cant be handled
Conclusion & Future work • A new framework has been proposed for OCRing the book. • A word recognition technique which uses the document constraints is shown. • An efficient clustering algorithm is used to speed up the process. • Word level accuracy is improved from 70.37% to 79.12%. • This technique can also be used for other languages. • Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.
LSHAlgorithm Algorithm: Word Image Clustering Require: Word Images Wj andFeatures Fj, j = 1,...,n Ensure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj ) Store word image Wj on bucket I of hash table Ti end for end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for Back
Word Error Correction Algorithm: Word Error Correction Require: Cluster C of words Wi ,i = 1,...,n Ensure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then Align word Wi and Wj Record errors Ek ,k = 1,...,m in Wi Record possible corrections Gk for Ek end if end for Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi end for Back
Dataset • 5000 clusters with 20 images of same word with different font size and resolution. • Words were generated using Image Magick. • Words were degraded with Kanungo degradation model to approximate real data. • SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.
Pre-processing Segmentation and word detection Word image Feature Extraction Hashed Words Hashing Hashing Feature Extraction OCR output Fusion Method 1 / Method 2 OCR Cluster of words Text