1 / 11

Sandhan

Sandhan. Indian language search engine. Sandhan – Consortium Project. IIT Bombay (co- ordinator ) CDAC Noida (co- cordinator ) CDAC Pune IIT Kharaghpur Jadhavpur University ISI Kolkata IIIT Hyderabad AU KBC AU CEG Gauhati University DAIICT Gujarat IIIT Bhubaneswar TDIL.

lovey
Download Presentation

Sandhan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sandhan Indian language search engine

  2. Sandhan – Consortium Project • IIT Bombay (co-ordinator) • CDAC Noida (co-cordinator) • CDAC Pune • IIT Kharaghpur • Jadhavpur University • ISI Kolkata • IIIT Hyderabad • AU KBC • AU CEG • Gauhati University • DAIICT Gujarat • IIIT Bhubaneswar • TDIL

  3. Introduction • Cross Lingual Information Retrieval (CLIR) engine for Indian languages • Input: Query in one of the six Indian languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya) • Output:In Hindi, English and Query Language • Currently in the second phase of the project • Three new languages are added in second phase • Assamese, Gujarati, Oriya • Built on top of Nutch Framework

  4. Software Used • Nutch v0.9 – Framework • Hadoop – Distributed Crawling • Lucene – Indexing • Moses/GIZA++ - Training models • Tomcat – Deployment

  5. Web Font Transcoder Information Extraction Analyzer Fetcher Language Identifier Snippet Translation NE Lookup CMLifier Domain Identifier Summary Generation MWE Lookup Indexer NE Lookup Snippet Generation Translation /Transliteration MWE Lookup Query Formulation Index UNL Index Analyzer

  6. Resources Developed • Language specific analyzers • Stop word List • Bilingual Dictionary ( X-English, X-Hindi) • NE List • MWE List • Transliteration Models

  7. IIT Bombay Participation • Marathi Vertical • Code Integration and Maintenance • MWE Identification • Development of Tracker • Error Analysis • Relevance Judgement

  8. Action Plan • Public release of 5 languages monolingual search engine on April 14th 2012 • Bengali, Hindi, Marathi, Tamil, Telugu • Public Release of remaining 4 languages monolingual search and 5 languages cross lingual search August 15th 2012 • Assamese, Gujarati, Oriya, Punjabi (Monolingual) • Bengali, Hindi, Marathi, Tamil, Telugu (Cross lingual)

  9. Horizontal Tasks Distribution

  10. Distribution of Vertical Tasks

  11. Key Achievements • Organized Forum for Information Retrieval (FIRE) 2008, 2010 and 2011 -a workshop for CLIR evaluation for Indian Languages • Demonstrated a basic integrated version of the system at IJCNLP 2008 and ELITEX 2008. Media coverage by ‘The Indian Express’ news paper and ‘Hindustan Times’ (http://www.cfilt.iitb.ac.in/pb_1.JPG) (http://www.cfilt.iitb.ac.in/04_04_2009_010_007.jpg) • Development of a strong and connected research community around CLIR in Indian languages. • Publications in top IR and NLP forum

More Related