1 / 22

Robust Recognition of Documents by Fusing Results of Word Clusters

Robust Recognition of Documents by Fusing Results of Word Clusters. Venkat Rasagna 1 , Anand Kumar 1 , C. V. Jawahar 1 , R. Manmatha 2 1 Center for Visual Information Technology, IIIT- Hyderabad 2 Center for Intelligent Information Retrieval, UMASS - Amherst. Introduction.

shaina
Download Presentation

Robust Recognition of Documents by Fusing Results of Word Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna1, Anand Kumar1, C. V. Jawahar1, R. Manmatha2 1Center for Visual Information Technology, IIIT- Hyderabad 2Center for Intelligent Information Retrieval, UMASS - Amherst

  2. Introduction • Recognition of books and collections. • Recognition of words is crucial to Information Retrieval. • Use of dictionaries and post processors are not feasible in many languages.

  3. Motivation • Most of the (Indian language) OCRs recognize glyph(component) and generate text from the class labels. • Word accuracies are far lower than component accuracies. • Word accuracy is inversely proportional to no. of components in the word. • Use of language model for post processing is challenging. • High entropy, Large vocabulary (eg. Telugu). • Language processing modules still emerging. Recognize Parse 100 50 word acc Component acc. word acc Is it possible to make use of multiple occurrence of the same word to improve OCR performance ? 0 Component Accuracy = 9 / 12 = 75% Word Accuracy = 25% Average word length = No of components

  4. Overview OCR output Goal OCR OCR OCR OCR OCR Multiple occurrences of a word • Words are degraded • independently • OCR output is different for • the word at different • instances Cluster Text

  5. Related Work • Character Recognition in Indian languages is still an unsolved problem. • Telugu is one of the most complex scripts. • Recognition of a book has received some attention recently. • Word images are efficiently matched for retreival. • Use of word image clusters to improve OCR accuracy Malayalam Bangla Hindi Tamil 1A. Negi et al., ICDAR, 2001 ; 2C. V. Jawahar et al., ICDAR, 2003; 3K. S. Sesh Kumar et al., ICDAR 2007 H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995 U. Pal, B. Chaudhuri, Pattern Recognition, 2004 1T. M. Rath et al., IJDAR, 2007;2T. M. Rath et al., CVPR, 2003;3Anand Kumar et al., ACCV, 2007 1P. Xiu and H. S. Baird, DRR XV,2008; 2N. V. Neeba, C. V. Jawahar, ICPR, 2008

  6. Conventional Recognition Process Proposed Recognition Process Scanned Images Segmentation and Word detection Preprocessing Word Grouping (Clustering) Feature Extraction Classification Word groups Word level Feature Extraction Recognizer Grouping Text (UNICODE) Combining OCR Results

  7. Locality Sensitive Hashing (LSH) • LSH Goal: “r-Near Neighbour” • for any query q, return a point p∈P such that||p-q|| ≤ r (if it exists) • LSH has been used for • Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000 • Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006 • Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007

  8. LSH clustering on word images[TODO]

  9. Character Majority Voting • Algorithm [TODO] Word Cluster OCR output Components Final Output

  10. Dynamic Programming Voting for 1 after aligning Dynamic Programming [1,2] DTW o/p for word 1 = CMV o/p for word 1 = Alignment

  11. Results • Word generation process makes correct annotations available for evaluating the performance. • 5000 clusters • 20 variations • Degraded dataset More Details

  12. Results • Word Accuracy Vs No. of words • Adding more no. of words makes the data set more ambiguous • Algorithm performance increases with no. of words, and saturates. • Word Accuracy Vs Word Length • Word accuracy decreases as the word length increase. • Use of the cluster info helps in gaining good word accuracies.

  13. Analysis

  14. Results • For a small increase in component accuracy, there is a large • improvement in the word accuracy. • The improvement is high for long words. • Relative improvement of 12% for words which occur at least twice.

  15. Analysis • Cuts and Merges • CMV vs. DTW • Wrong word in the cluster. • Cases that cant be handled

  16. Conclusion & Future work • A new framework has been proposed for OCRing the book. • A word recognition technique which uses the document constraints is shown. • An efficient clustering algorithm is used to speed up the process. • Word level accuracy is improved from 70.37% to 79.12%. • This technique can also be used for other languages. • Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.

  17. END

  18. Additional slides

  19. LSHAlgorithm Algorithm: Word Image Clustering Require: Word Images Wj andFeatures Fj, j = 1,...,n Ensure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj ) Store word image Wj on bucket I of hash table Ti end for end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for Back

  20. Word Error Correction Algorithm: Word Error Correction Require: Cluster C of words Wi ,i = 1,...,n Ensure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then Align word Wi and Wj Record errors Ek ,k = 1,...,m in Wi Record possible corrections Gk for Ek end if end for Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi end for Back

  21. Dataset • 5000 clusters with 20 images of same word with different font size and resolution. • Words were generated using Image Magick. • Words were degraded with Kanungo degradation model to approximate real data. • SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.

  22. Pre-processing Segmentation and word detection Word image Feature Extraction Hashed Words Hashing Hashing Feature Extraction OCR output Fusion Method 1 / Method 2 OCR Cluster of words Text

More Related