1 / 14

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY. CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP, DOE Govt. of India. 2. IPDA, DOE Govt. of India. 3. TRCT, TDIL, MCIT 4. English-Telugu, T2TMT UPE, UGC, UOH.

fawzi
Download Presentation

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1.NLP-TTP, DOE Govt. of India. 2.IPDA, DOE Govt. of India. 3.TRCT, TDIL, MCIT 4.English-Telugu, T2TMT UPE, UGC, UOH.

  2. 1.Morphological Analyzer cum Spell Checker for Telugu • A robust Morphological analyzer cum Spell Checker for Telugu. • With 97% recognition rate. • Tested on 5 million word corpora. • For the users of Windows O.S & Linux.

  3. 2.A Multilingual Encyclopedic Electronic thesaurus for translators, MEET, a Web based linguistic application. • MEET enables quick access to various synonyms. • Provides equivalents in other Indian languages and English. • Also provides grammatical and Semantic information. • A useful application for translators. • Provides access to information in Indian languages on the web. • Currently includes only Marathi, Hindi, Bangla, Konkani and English. • The 2nd phase proposes to include Telugu, Kannada and Oriya. • Word net for individual languages may be linked to the system.

  4. 3.   Telugu Hyper Grammar. • The Telugu Hyper Grammar, designed as a dynamically accessed and non-linearly organized grammar of Telugu grammar. • A user can access information at a particular module from any other module. • Provides access to a Morphological Analyzer, Generator and a Chunker. • Can access various bilingual and bi-directional digital lexica of Telugu and other Indian Languages like Hindi, Kannada, Tamil, Marathi, Oriya, Malayalam and English.

  5. 4.English-Telugu Parallel Corpora. • Parallel Corpora are a set of thematically corresponding digital texts of some selected works. • Recent trends in Machine Translation are revolutionized by the use of Parallel Corpora. • Parallel Corpora give way to discover similarities and differences between a pair of languages. • A program for aligning parallel texts in English and Telugu is developed and in the process of testing. • Selected parallel texts in Telugu, Kannada, Tamil, Marathi and Malayalam are digitized.

  6. 5.English-Telugu T2T Machine Translation System • English-Telugu Machine Translation System is being built at CALTS in collaboration with, IIIT, Hyderabad; Telugu University, Hyderabad; Osmania University, Hyderabad. • Uses an English-Telugu MAT lexicon of 42K. • A wordform synthesizer for Telugu is developed and incorporated. • It incorporates an evolutionary semantic lexicon • It handles English sentences of a variety of complexity

  7. 6.MAT Lexica. • Bilingual and Multidirectional. • Machine Readable Dictionaries for Telugu-Hindi, Telugu-Kannada, Telugu-Tamil, Telugu-Marathi, Telugu-Oriya, Telugu-Bangla, Telugu-Malayalam, of 10K are being developed in collaboration with the Telugu Academy. • The entries were based on the frequency of their occurrence in the corpus of Telugu. • The Dictionaries of Telugu-Hindi, Telugu-Kannada, Telugu-Tamil are already completed. • Major part of these dictionaries are developed through realigning the lexical resources existing at CALTS.

  8. 7.Collocations in Indian Languages. • Collocations or specialized word sequences play a crucial role in a language. It is extremely difficult to identify and translate effectively. They present one of the most challenging tasks in Natural Language Processing. • In the first phase, Telugu data was collected and analyzed. • A long list of collocations are collected and checked whether the existing criteria are valid. • These collocations are compared against other specialized word sequences in the language to understand their functional and distributional properties.

  9. 8.Machine Readable Dictionary of Idioms (Telugu-English). • Idioms are extremely important but the most ubiquitous, and less understood categories of language. • Machine-readable Idioms in English and their equivalents in Telugu and the mechanics of their recognition and transfer rules are being developed. • The Machine Readable text will be implemented in XML so that access and retrieval becomes easier and faster.

  10. 9.Electronic Adult Literacy Primer for Telugu • This is developed as part of CALTS participation in Arohan (a literacy campaign adopted by the university). • Aimed at teaching the script or the written form of the language rather than the language itself. • Based on frequency of characters in the written texts. • Learning the most frequent but few characters would ensure greater coverage in learning recognition of characters. • Special features include characters with animation and speech. • A special attention on the presentation of allographs.

  11. 10.A generic system for morphological generation for Indian languages • Morphological generators for various Indian languages particularly for Telugu, Kannada, Tamil, Malayalam, Bangla and Oriya are in different stages of development. • A generic framework for wordform synthesis for Indian languages. • Includes testing module to find the efficiency and coverage of the system.

  12. 11.Telugu-Tamil Machine translation system • Using the available resources at CALTS a Telugu-Tamil MT is being developed. • Uses the Telugu Morphological analyzer. • Uses the Tamil generator developed at CALTS. • Uses Telugu-Tamil dictionary developed as part of MAT Lexica. • Uses verb sense disambiguator based on verbs argument structure.

  13. 12.Word Sense Disambiguation using Argument Structure: • A system, based on the argument structure of Telugu verbs. • Uses feature based semantic lexicon. • Efficiently disambiguates polysemy of verbs in the context. • Is incorporated in Telugu-Tamil MT system.

  14. 13.A case sensitive roman translation for Indian languages as overall pattern • A roman transliteration Scheme for unwritten languages of India is developed. • A common transliteration scheme for the scripts of Brahmi derivates and non Brahmi derivates is developed. • Supra segmentals mapped on to roman characters • No nonunique character mapping • Allows complete conversion between various languages

More Related