Marathi Portal Search Engine: Enhancing Information Retrieval

The Marathi Portal with a Search EngineCenter for Indian Language Technology Solutions, IIT Bombay

A Search Engine • To promote use of information available on web in Marathi language • Locate the right pages that you need • Present the pages to the user in an order of importance

Types of Searches • Based on user queries • Category based search • Browse through pre-classified categories • Search selected literature which will be hosted on the Marathi Portal

Search Engine: Performance Criteria • Coverage • Cover as many pages as possible. A study has revealed that a large part of the web remains un-indexed • Response time • The user should be presented with the results as quickly as possible • Relevance • The information presented should be relevant and ordered in an order of importance

Main Components of a Search Engine • Crawling unit • Indexing unit • Searching unit • Ranking unit

A Prototype • A prototype has been developed to gauge the complexity and architectural issues involved in developing the complete Marathi Portal

About the Prototype • A search engine prototype has been built with manually selected sites in different categories • It indexes about 1800 pages consisting of over 10,14,000 words • The Engine is developed on Windows platform on MS Access • Monolingual ISFOC pages are covered

Ranking Criteria used in the prototype • Number of words in the query string that appear in the document • In OR search, documents containing maximum number of words in the string is ranked higher • Proximity between words • No. of words that are together within distance of 5 words • Context of the word • Is it in title or body? • Frequency of the desired word in the document • No. of occurrences of the word

A Fast Engine is under Development • A Linux based fast prototype for the same number of pages is being developed. • It takes 2 minutes to build the dictionary, 2 hours to build the index and less than a second to search

What if the Machine that hosts the engine fails? • The index must be in main memory while search is being performed • You cannot afford to loose the index since it would take days (even months for large engines) to build it again on a large number of pages • Dumping the index of the Linux prototype through traversal takes around 35 minutes • But to load it in main memory took 2 minutes!

Requirements from the Infrastructure for the actual Portal • High RAM – in GBs • High Computing Power: Parallel Processing through network of workstations • Parallel IO • As number of users increase, more and more parallelism will have to be employed to guarantee same performance criteria to each user

Representations and Fonts • Currently only ISFOC is supported • There are sites in Marathi with different types of encodings which need to be integrated • Converters • Input/Display technology for Linux

Crawling • Crawling and meta-crawling techniques • Some interesting facts: • E.g. it was found that word ‘Aahe’ is one of the most widely occurring words • Words Aahe and Aani together span most of the documents • There are specific words that occur most widely and most frequently in different categories

Indexing and Searching • Incremental • Dynamic • Fast Search • In Memory

Relevancy • What the user really wants • Heuristics for ranking results • Query modification

Selected Texts • Saint Tukarama’s Abhangs will be made searchable and will be hosted on this website • Search on other selected texts will also be hosted on this website

Marathi Portal Search Engine: Enhancing Information Retrieval