Robust Recognition of Documents by Fusing Results of Word Clusters

Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna1, Anand Kumar1, C. V. Jawahar1, R. Manmatha2 1Center for Visual Information Technology, IIIT- Hyderabad 2Center for Intelligent Information Retrieval, UMASS - Amherst

Introduction • Recognition of books and collections. • Recognition of words is crucial to Information Retrieval. • Use of dictionaries and post processors are not feasible in many languages.

Motivation • Most of the (Indian language) OCRs recognize glyph(component) and generate text from the class labels. • Word accuracies are far lower than component accuracies. • Word accuracy is inversely proportional to no. of components in the word. • Use of language model for post processing is challenging. • High entropy, Large vocabulary (eg. Telugu). • Language processing modules still emerging. Recognize Parse 100 50 word acc Component acc. word acc Is it possible to make use of multiple occurrence of the same word to improve OCR performance ? 0 Component Accuracy = 9 / 12 = 75% Word Accuracy = 25% Average word length = No of components

Overview OCR output Goal OCR OCR OCR OCR OCR Multiple occurrences of a word • Words are degraded • independently • OCR output is different for • the word at different • instances Cluster Text

Related Work • Character Recognition in Indian languages is still an unsolved problem. • Telugu is one of the most complex scripts. • Recognition of a book has received some attention recently. • Word images are efficiently matched for retreival. • Use of word image clusters to improve OCR accuracy Malayalam Bangla Hindi Tamil 1A. Negi et al., ICDAR, 2001 ; 2C. V. Jawahar et al., ICDAR, 2003; 3K. S. Sesh Kumar et al., ICDAR 2007 H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995 U. Pal, B. Chaudhuri, Pattern Recognition, 2004 1T. M. Rath et al., IJDAR, 2007;2T. M. Rath et al., CVPR, 2003;3Anand Kumar et al., ACCV, 2007 1P. Xiu and H. S. Baird, DRR XV,2008; 2N. V. Neeba, C. V. Jawahar, ICPR, 2008

Conventional Recognition Process Proposed Recognition Process Scanned Images Segmentation and Word detection Preprocessing Word Grouping (Clustering) Feature Extraction Classification Word groups Word level Feature Extraction Recognizer Grouping Text (UNICODE) Combining OCR Results

Locality Sensitive Hashing (LSH) • LSH Goal: “r-Near Neighbour” • for any query q, return a point p∈P such that||p-q|| ≤ r (if it exists) • LSH has been used for • Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000 • Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006 • Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007

LSH clustering on word images[TODO]

Character Majority Voting • Algorithm [TODO] Word Cluster OCR output Components Final Output

Dynamic Programming Voting for 1 after aligning Dynamic Programming [1,2] DTW o/p for word 1 = CMV o/p for word 1 = Alignment

Results • Word generation process makes correct annotations available for evaluating the performance. • 5000 clusters • 20 variations • Degraded dataset More Details

Results • Word Accuracy Vs No. of words • Adding more no. of words makes the data set more ambiguous • Algorithm performance increases with no. of words, and saturates. • Word Accuracy Vs Word Length • Word accuracy decreases as the word length increase. • Use of the cluster info helps in gaining good word accuracies.

Analysis

Results • For a small increase in component accuracy, there is a large • improvement in the word accuracy. • The improvement is high for long words. • Relative improvement of 12% for words which occur at least twice.

Analysis • Cuts and Merges • CMV vs. DTW • Wrong word in the cluster. • Cases that cant be handled

Conclusion & Future work • A new framework has been proposed for OCRing the book. • A word recognition technique which uses the document constraints is shown. • An efficient clustering algorithm is used to speed up the process. • Word level accuracy is improved from 70.37% to 79.12%. • This technique can also be used for other languages. • Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.

END

Additional slides

LSHAlgorithm Algorithm: Word Image Clustering Require: Word Images Wj andFeatures Fj, j = 1,...,n Ensure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj ) Store word image Wj on bucket I of hash table Ti end for end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for Back

Word Error Correction Algorithm: Word Error Correction Require: Cluster C of words Wi ,i = 1,...,n Ensure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then Align word Wi and Wj Record errors Ek ,k = 1,...,m in Wi Record possible corrections Gk for Ek end if end for Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi end for Back

Dataset • 5000 clusters with 20 images of same word with different font size and resolution. • Words were generated using Image Magick. • Words were degraded with Kanungo degradation model to approximate real data. • SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.

Pre-processing Segmentation and word detection Word image Feature Extraction Hashed Words Hashing Hashing Feature Extraction OCR output Fusion Method 1 / Method 2 OCR Cluster of words Text

Robust Recognition of Documents by Fusing Results of Word Clusters

Robust Recognition of Documents by Fusing Results of Word Clusters

Presentation Transcript

Auditory Word Recognition

Word Recognition

Legal recognition of electronic alternatives to documents of title

Robust Speech recognition

Visual Word Recognition

Visual Word Recognition

Robust Recognition of Emotion from Speech

Word Recognition Strategies

Word Recognition Inventory

Cross-linguistic Studies of Visual Word Recognition

Word Recognition Device

Robust Activity Recognition

Continuous Word Recognition

Robust Speaker Recognition

Better Recognition by manipulation of ASR results

Groups, Clusters and Clusters of Clusters

Clustering of color sources Fragmentation of clusters Results Conclusions

Word Recognition

Word Recognition

Formatting Word Documents

Legal recognition of electronic alternatives to documents of title

Editing word documents