270 likes | 808 Views
A Simple English-to-Punjabi Translation System. By : Shailendra Singh. Introduction. Internet has influenced multilingualism and language industry Internet users require information in the language they understand comprehensively
E N D
A Simple English-to-Punjabi Translation System By : Shailendra Singh
Introduction • Internet has influenced multilingualism and language industry • Internet users require information in the language they understand comprehensively • Machine Translation (MT) as a computer system is needed to translate from a source language to target language • Currently English-to-Punjabi computer translation systems are mostly based on word to word translation only • The focus here is on a simple machine translation from English-to-Punjabi
Introduction • English is structured in terms of Subject –Verb-Object • Punjabi is structured in terms of Subject –Object-Verb
Literature Review • MT has numerous strategies applied over time • Strategies are ranging from direct approach to the latest ones like example based machine translation
Direct Approach • Most Primitive • Translation is word-for-word or phrase-to-phrase • Need very large bilingual dictionary • Very little of language analysis is involved because mostly just based on dictionary • In short the translation result is very inaccurate and many errors • Example system : SYSTRAN
Rule Based • Full with many different types of rule example are syntactic rule, lexical rules, lexical transfer rules, rules for syntactic generation, rules for morphology and etc • Starts with building morphological tree and transformed into syntactic tree and lastly into semantic tree (Hutchins 1994). • Crucial step is transformation of source language to target language • All the rules here actually refers to the particular grammars of the languages involved in translation • Example system : The Ariane, SUSY
Rule Based • Advantages : Deep analysis on the translation process • Disadvantages : 1) Requires much linguistic knowledge. 2) Impossible to write rules that cover all a language. 3) Transformation rules are always specified for a single language pair and the system is therefore difficult to overlook. 4) Introduce inconsistency when the rules increase and involves a lot of cost
Knowledge Based • Mainly describes a rule based system displaying extensive semantic and pragmatic knowledge of a domain, including an ability to reason, to some limited extent, about concepts in the domain. Arnold et al. (1994, page 190). • Mostly the features are the same like rule based • Distinctive in terms focused towards a particular domain thus minimizes ambiguity • Example system : KANT – translates electronic manuals
Knowledge Based • Advantages : Avoid ambiguity thus gives a quality result on the translation • Disadvantages : 1)Focused just on a domain thus limits its capability 2)Domain chosen must have enough knowledge to accommodate the translation
Statistical Based • Geared by an experiment in 1989 by a team from IBM • The results of experiment seams to be attractive and acceptable • This new methodology was fully based on statistical methods • Marked a new approach towards MT which is called as ‘corpus base’ • Vast corpus of language is a main component • Alignment of words and phrase will be done with the corpus and later calculate the probabilities (Hutchins 1994) • Example system : The Candide
Statistical Based • Disadvantages : 1) Requires training on huge data with good quality bilingual corpora 2) Cannot work for complex translation as the process becomes too complex to handle
Example Based • Translation is done by analogy • Relies on past translation examples • Past translation examples are regarded as accurate in term of syntactically, grammatically and also semantically • Example of translation are kept in a store also known as corpus. Hence this is also ‘corpus base’ • Sentence to be translated will be matched against examples in the database • The closest match will be selected and replaced according to cater for the input sentence
Methodology • In this paper we are taking the EBMT approach • Justification for choosing EBMT : • Has eliminated the problems of tractability, scalability and performance which is found in older MT strategies – (extensive knowledge) • Very minimal formal work has been done on Punjabi language it self thus it builds a barrier if we would take rule based approach • In EBMT correct and accurate example translations are needed which fulfills to the situation
Methodology • The first step in EBMT is preparing the corpus • Design for the corpus is as below :
Methodology • The sentence to be translated is “I am going to eat vegetables” • First do sentence tokenizing • Following is morphological analysis and tagging of part of speech • Key matching with the ‘corpus’ • In this case the key is eat
Methodology • Based on the template here the input sentence will be NP.PeatN output will be NP.NkhaanP The necessary filling of the template will be done based on the lexicon look up
Methodology • “I am going to eatvegetables” • “Mehsavajikhaan lageaa hai”
Analysis • Not much of linguistic knowledge is needed • Faster in terms of performance because not much of linguistic processing is involved
Conclusion • EBMT is suitable in the case where there is not much of formal linguistic knowledge of languages involved is available • EBMT is good in the case where you do not need much deep analysis of linguistic
References • Hutchins J. ( 1994), Research methods and system designs in machine translation a ten-year review, 1984-1994, International conference 'Machine translation: ten years on‘. • Arnold, D., Balkan, L., Humphreys, R. L., Meijer, S. & Sadler, L. (1994), Machine translation: an introductory guide, Blackwells /NCC, London.