1 / 62

Content Level Access to Digital Library of India Pages

Content Level Access to Digital Library of India Pages. Praveen Krishnan , Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad. Digital Library of India (DLI). http://www.dli.iiit.ac.in/. Vision : T o enhance access to information and knowledge to masses.

nasya
Download Presentation

Content Level Access to Digital Library of India Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad

  2. Digital Library of India (DLI) http://www.dli.iiit.ac.in/ Vision : To enhance access to information and knowledge to masses. • Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.

  3. Digital Library of India (DLI) Vision : To enhance access to information and knowledge to masses. Languages Content Statistics • #Books 4 Lakhs • #Pages 134 Million • #Words 26 Billion • 41 different languages • Includes • - Hindi, Telugu, Marathi.. • - English, French, Greek.. Source: http://www.new1.dli.ernet.in/

  4. Digital Library of India (DLI) Meta data search • Supports Meta data based search. • No Content Level Access Indian freedom struggle and independence Search

  5. Digital Library of India (DLI) • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search

  6. Digital Library of India (DLI) Reliable Text Representation ? • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search

  7. Goal Digital Library of India Search • Build a search engine with support for Indian languages. • Word Spotting

  8. Goal Indian Language Document Search Engine Text Query Support खोज Page 1

  9. Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1

  10. Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1

  11. Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1

  12. Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1

  13. Text from OCR Hindi Page Telugu Page - Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960

  14. Text from OCR Hindi Page Telugu Page Cuts Cuts

  15. Text from OCR Hindi Page Telugu Page Cuts Merges

  16. Text from OCR Hindi Page Telugu Page Cuts Variations in Script, Font and Typesetting.

  17. Text from OCR Char % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

  18. Text from OCR Word % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

  19. Text from OCR Search % Hindi Telugu

  20. BoVW for Image Retrieval Text Retrieval Image Recognition Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

  21. BoVW for Image Retrieval • Fixed Length Representation • Invariant to popular deformation Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

  22. BoVW for Document Image Retrieval R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

  23. BoVW for Document Image Retrieval Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

  24. BoVW for Document Image Retrieval Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

  25. BoVW for Document Image Retrieval Cuts Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

  26. BoVW for Document Image Retrieval Merges R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

  27. BoVW for Document Image Retrieval Merges Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

  28. BoVW for Document Image Retrieval y 1 • Robust against degradation • Lost Geometry • Use Spatial Verification • SIFT based. • Longest Subsequence alignment. 0.5 x 0 3 0.5 2 1 1.5 2.5 V2 V9 V1 V8 V6 V4 V4 Cuts Merge Clean R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012.

  29. Query Expansion Querying Database Query Image Query Image Histogram Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 1 Refined Histogram

  30. Query Expansion Querying Database Query Image Query Histogram Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 1 Better Results

  31. Text Query Support • Originally formulated in a “query by example” setting. Input Query Image Histogram

  32. Text Query Support • Originally formulated in a “query by example” setting. • Need Text Queries Input Text Query Text Query Histogram

  33. Observations • Are the results of OCR and BoVW complementary? OCR BoVW BoVW OCR

  34. Observations • mAP v/s Word Length mAP No. of Characters

  35. Observations • “OCR system has a high precision while BoVW approach has a high recall.” • Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1

  36. Fusion • Fusion Techniques:- • Naïve Fusion mAP Chart OCR

  37. Fusion • Fusion Techniques:- • Naïve Fusion mAP Chart BoVW

  38. Fusion • Fusion Techniques:- • Naïve Fusion Concatenating OCR Results with BoVW mAP Chart OCR BoVW

  39. Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart OCR BoVW

  40. Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart • Reordering BoVW • BoVW score • Modified Edit distance cost BoVW

  41. Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart • Reordering BoVW • BoVW score • Modified Edit distance cost BoVW

  42. Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart OCR BoVW

  43. Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart OCR BoVW

  44. Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart • Re-querying BoVW using • OCR retrieved results. • Using rank aggregation • techniques BoVW

  45. Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart • Re-querying BoVW using • OCR retrieved results. • Using rank aggregation • techniques BoVW

  46. Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart OCR BoVW

  47. Experimental Results

  48. Experimental Details • OCR [1] • Feature Detector • Harris Interest point detection. [2] • Feature Descriptor • SIFT [2] • Indexing • Lucene [3] [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/

  49. Test Bed Sample Word Images DLI Corpus • In addition, we used HP1 & TP1 fully annotated dataset

  50. Precision • Recall • mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries. • Precision @ 10 Shows how accurate top 10 retrieved results are. Evaluation Measures TP = True Positive FP = False Positive FN = False Negative Precision-Recall Curve

More Related