140 likes | 225 Views
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY. CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP, DOE Govt. of India. 2. IPDA, DOE Govt. of India. 3. TRCT, TDIL, MCIT 4. English-Telugu, T2TMT UPE, UGC, UOH.
E N D
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1.NLP-TTP, DOE Govt. of India. 2.IPDA, DOE Govt. of India. 3.TRCT, TDIL, MCIT 4.English-Telugu, T2TMT UPE, UGC, UOH.
1.Morphological Analyzer cum Spell Checker for Telugu • A robust Morphological analyzer cum Spell Checker for Telugu. • With 97% recognition rate. • Tested on 5 million word corpora. • For the users of Windows O.S & Linux.
2.A Multilingual Encyclopedic Electronic thesaurus for translators, MEET, a Web based linguistic application. • MEET enables quick access to various synonyms. • Provides equivalents in other Indian languages and English. • Also provides grammatical and Semantic information. • A useful application for translators. • Provides access to information in Indian languages on the web. • Currently includes only Marathi, Hindi, Bangla, Konkani and English. • The 2nd phase proposes to include Telugu, Kannada and Oriya. • Word net for individual languages may be linked to the system.
3. Telugu Hyper Grammar. • The Telugu Hyper Grammar, designed as a dynamically accessed and non-linearly organized grammar of Telugu grammar. • A user can access information at a particular module from any other module. • Provides access to a Morphological Analyzer, Generator and a Chunker. • Can access various bilingual and bi-directional digital lexica of Telugu and other Indian Languages like Hindi, Kannada, Tamil, Marathi, Oriya, Malayalam and English.
4.English-Telugu Parallel Corpora. • Parallel Corpora are a set of thematically corresponding digital texts of some selected works. • Recent trends in Machine Translation are revolutionized by the use of Parallel Corpora. • Parallel Corpora give way to discover similarities and differences between a pair of languages. • A program for aligning parallel texts in English and Telugu is developed and in the process of testing. • Selected parallel texts in Telugu, Kannada, Tamil, Marathi and Malayalam are digitized.
5.English-Telugu T2T Machine Translation System • English-Telugu Machine Translation System is being built at CALTS in collaboration with, IIIT, Hyderabad; Telugu University, Hyderabad; Osmania University, Hyderabad. • Uses an English-Telugu MAT lexicon of 42K. • A wordform synthesizer for Telugu is developed and incorporated. • It incorporates an evolutionary semantic lexicon • It handles English sentences of a variety of complexity
6.MAT Lexica. • Bilingual and Multidirectional. • Machine Readable Dictionaries for Telugu-Hindi, Telugu-Kannada, Telugu-Tamil, Telugu-Marathi, Telugu-Oriya, Telugu-Bangla, Telugu-Malayalam, of 10K are being developed in collaboration with the Telugu Academy. • The entries were based on the frequency of their occurrence in the corpus of Telugu. • The Dictionaries of Telugu-Hindi, Telugu-Kannada, Telugu-Tamil are already completed. • Major part of these dictionaries are developed through realigning the lexical resources existing at CALTS.
7.Collocations in Indian Languages. • Collocations or specialized word sequences play a crucial role in a language. It is extremely difficult to identify and translate effectively. They present one of the most challenging tasks in Natural Language Processing. • In the first phase, Telugu data was collected and analyzed. • A long list of collocations are collected and checked whether the existing criteria are valid. • These collocations are compared against other specialized word sequences in the language to understand their functional and distributional properties.
8.Machine Readable Dictionary of Idioms (Telugu-English). • Idioms are extremely important but the most ubiquitous, and less understood categories of language. • Machine-readable Idioms in English and their equivalents in Telugu and the mechanics of their recognition and transfer rules are being developed. • The Machine Readable text will be implemented in XML so that access and retrieval becomes easier and faster.
9.Electronic Adult Literacy Primer for Telugu • This is developed as part of CALTS participation in Arohan (a literacy campaign adopted by the university). • Aimed at teaching the script or the written form of the language rather than the language itself. • Based on frequency of characters in the written texts. • Learning the most frequent but few characters would ensure greater coverage in learning recognition of characters. • Special features include characters with animation and speech. • A special attention on the presentation of allographs.
10.A generic system for morphological generation for Indian languages • Morphological generators for various Indian languages particularly for Telugu, Kannada, Tamil, Malayalam, Bangla and Oriya are in different stages of development. • A generic framework for wordform synthesis for Indian languages. • Includes testing module to find the efficiency and coverage of the system.
11.Telugu-Tamil Machine translation system • Using the available resources at CALTS a Telugu-Tamil MT is being developed. • Uses the Telugu Morphological analyzer. • Uses the Tamil generator developed at CALTS. • Uses Telugu-Tamil dictionary developed as part of MAT Lexica. • Uses verb sense disambiguator based on verbs argument structure.
12.Word Sense Disambiguation using Argument Structure: • A system, based on the argument structure of Telugu verbs. • Uses feature based semantic lexicon. • Efficiently disambiguates polysemy of verbs in the context. • Is incorporated in Telugu-Tamil MT system.
13.A case sensitive roman translation for Indian languages as overall pattern • A roman transliteration Scheme for unwritten languages of India is developed. • A common transliteration scheme for the scripts of Brahmi derivates and non Brahmi derivates is developed. • Supra segmentals mapped on to roman characters • No nonunique character mapping • Allows complete conversion between various languages