620 likes | 772 Views
Content Level Access to Digital Library of India Pages. Praveen Krishnan , Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad. Digital Library of India (DLI). http://www.dli.iiit.ac.in/. Vision : T o enhance access to information and knowledge to masses.
E N D
Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad
Digital Library of India (DLI) http://www.dli.iiit.ac.in/ Vision : To enhance access to information and knowledge to masses. • Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.
Digital Library of India (DLI) Vision : To enhance access to information and knowledge to masses. Languages Content Statistics • #Books 4 Lakhs • #Pages 134 Million • #Words 26 Billion • 41 different languages • Includes • - Hindi, Telugu, Marathi.. • - English, French, Greek.. Source: http://www.new1.dli.ernet.in/
Digital Library of India (DLI) Meta data search • Supports Meta data based search. • No Content Level Access Indian freedom struggle and independence Search
Digital Library of India (DLI) • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search
Digital Library of India (DLI) Reliable Text Representation ? • Need Content Level Access • Content + Meta Data Indian freedom struggle and independence Search
Goal Digital Library of India Search • Build a search engine with support for Indian languages. • Word Spotting
Goal Indian Language Document Search Engine Text Query Support खोज Page 1
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1
Text from OCR Hindi Page Telugu Page - Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960
Text from OCR Hindi Page Telugu Page Cuts Cuts
Text from OCR Hindi Page Telugu Page Cuts Merges
Text from OCR Hindi Page Telugu Page Cuts Variations in Script, Font and Typesetting.
Text from OCR Char % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR Word % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR Search % Hindi Telugu
BoVW for Image Retrieval Text Retrieval Image Recognition Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
BoVW for Image Retrieval • Fixed Length Representation • Invariant to popular deformation Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
BoVW for Document Image Retrieval R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Cuts Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Merges R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Merges Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval y 1 • Robust against degradation • Lost Geometry • Use Spatial Verification • SIFT based. • Longest Subsequence alignment. 0.5 x 0 3 0.5 2 1 1.5 2.5 V2 V9 V1 V8 V6 V4 V4 Cuts Merge Clean R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012.
Query Expansion Querying Database Query Image Query Image Histogram Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 1 Refined Histogram
Query Expansion Querying Database Query Image Query Histogram Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 1 Better Results
Text Query Support • Originally formulated in a “query by example” setting. Input Query Image Histogram
Text Query Support • Originally formulated in a “query by example” setting. • Need Text Queries Input Text Query Text Query Histogram
Observations • Are the results of OCR and BoVW complementary? OCR BoVW BoVW OCR
Observations • mAP v/s Word Length mAP No. of Characters
Observations • “OCR system has a high precision while BoVW approach has a high recall.” • Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1
Fusion • Fusion Techniques:- • Naïve Fusion mAP Chart OCR
Fusion • Fusion Techniques:- • Naïve Fusion mAP Chart BoVW
Fusion • Fusion Techniques:- • Naïve Fusion Concatenating OCR Results with BoVW mAP Chart OCR BoVW
Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart OCR BoVW
Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart • Reordering BoVW • BoVW score • Modified Edit distance cost BoVW
Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart • Reordering BoVW • BoVW score • Modified Edit distance cost BoVW
Fusion • Fusion Techniques:- • Edit Distance Based Fusion mAP Chart OCR BoVW
Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart OCR BoVW
Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart • Re-querying BoVW using • OCR retrieved results. • Using rank aggregation • techniques BoVW
Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart • Re-querying BoVW using • OCR retrieved results. • Using rank aggregation • techniques BoVW
Fusion • Fusion Techniques:- • Hybrid Fusion mAP Chart OCR BoVW
Experimental Details • OCR [1] • Feature Detector • Harris Interest point detection. [2] • Feature Descriptor • SIFT [2] • Indexing • Lucene [3] [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/
Test Bed Sample Word Images DLI Corpus • In addition, we used HP1 & TP1 fully annotated dataset
Precision • Recall • mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries. • Precision @ 10 Shows how accurate top 10 retrieved results are. Evaluation Measures TP = True Positive FP = False Positive FN = False Negative Precision-Recall Curve