200 likes | 286 Views
New Directions in Machine Translation Introduction. 陳惠群 中央研究院 語言所 / 資訊所. Why MT Matters?. Economics Costs / Quality / Turnaround Many MT developers, customers, and sponsors have already invested a lot for years. Politics Multi-lingual Countries / Minority Languages
E N D
New Directions inMachine TranslationIntroduction 陳惠群 中央研究院 語言所/資訊所
Why MT Matters? • Economics • Costs / Quality / Turnaround • Many MT developers, customers, and sponsors have already invested a lot for years. • Politics • Multi-lingual Countries / Minority Languages • Intelligence Gathering • Governments / Companies / Individuals • Research • AI / CS / Linguistics / Psychology / and so on
Recent Trends • PC-based MT Systems • Online MT Services, MT on Demand • Email, Web pages, Uploads • Sub-language MT Systems • Dialog-based (Speech-to-Speech) MT Systems • Computer-Assisted Translation
Classifying MT Systems • Operations • Fully-Automatic MT • Semi-automatic MT • Computer-Assisted Translation (CAT-Tools) • Input • Unrestricted Texts • Restricted Texts (e.g.Technical Manuals) / MT in mind • Sub-languages / Controlled languages • Quality • High / Low / Acceptable / Applicable / Readable • How to evaluate a MT system? • Strategies (see next page)
MT Strategies • Fundamentals • Direct Translation MT • Transfer-based MT • Interlingua MT • Linguists vs. Empiricists • New Strategies • Knowledge-based MT • Example-based MT • Statistics-based MT • Hybrid MT • Japanese manufacturers know well that a single linguistic theory cannot lead to a good MT system. They realize that a huge amount of language phenomena must be processed in an ad-hoc manner. (M. Nagao)
Simple syntactic analysis (disambiguation) • Bilingual lexicon (word-by-word translation) • Re-ordering rules Target Text Source Text Direct MT
Source Text (ST) ST analysis structure transfer TT generation SL grammar & lexicon TL grammar & lexicon ST Structure TT Structure Target Text (TT) Transfer-based MT SL-TL lexicon & transfer rules SL - source language; TL - target language
ST analysis TT generation SL grammar & lexicon TL grammar & lexicon Interlingua-based MT Source Text (ST) Interlingua representation (+SL-TL lexicon) Target Text (TT)
Knowledge-based MT • All world knowledge? A long-term research • Practical Systems: e.g. CMU’s KANT • narrow domain • domain model: defines all semantic classes and instances to represent all concepts in the domain • each concept definition includes: • concept head (name of the concept) • slots: allowable semantic roles • fillers: allowable concept classes that the roles can contain • disambiguation by filler restriction • knowledge acquisition • automatic or semi-automatic
Example-based MT • A companion module to improve MT quality • Typically include the following (Nirenburg 1995): • sentence-aligned corpus • intra-language matching • find chunks from source language part of the corpus which are best candidates for matching an input chunk • inter-language matching • find the target language chunk corresponding to the chunk from the source language part of the corpus • chunk-combination The PANGLOSS Mark III Machine Translation System. S. Nirenburg, Technical Report CMU-CMT-95-145. 1995. (available online at http://www.lti.cs.cmu.edu/Research/CMT-home.html)
Statistics-based MT(1) • Maximize Pr(S|T) = Pr(S) Pr(T|S) / Pr(T) • Pr(S): source language model • Pr(T|S): translation model • lexical translation, distortion, and fertility • Some comments: (Machine Translation 7:(4)) • I joined the attack … without realizing that precisely what the research was doing was to question some of the fundamental assumptions underlying MT research since 1966 … With hindsight, I can see that what this research was doing was saying that in the 20 years since ALPAC, the second generation architecture had led to only slightly better results than the architecture it replaced … (Harold Somers) • My initial reaction was the same as Somers. … The integration of a CANDIDE-type engine into a traditional MT architecture should probably at the deepest level the architecture allows (John White)
Statistics-based MT(2) • Machine Translation 7:(4) • ...not only does it need no linguistics or linguists, but no foreign speakers either. ... about 43% of sentences correctly translated. That compares badly with SYSTRAN which is usually assigned figures of around 65% … even if it did equal SYSTRAN’s level of performance, it is not clear what inferences we should draw.… we must always remember that they need millions of words of parallel texts even to start … The problems noted then were of long-distance dependencies: … French and English … were a lucky choice … we have good historical reasons for believing that a purely statistical method cannot do high-quality MT (Yorick Wilks) • Word alignment
Evaluation • Traditional Evaluation Metrics (Church & Hovy) • System-based Metrics • easy to measure, but only for a particular system • e.g. 60 sub-grammars, 900 rewriting rules, … • Text-based Metrics • sentence-based metrics • e.g. # of semantically or syntactically correct sentences • compressibility metrics • amount of post-editing metrics • Cost-based Metrics: cost & time (per N words) • Demos (must avoid misleading) • Developer’s view or Customer’s view
Some MT Problems • Morphological ambiguity • Lexical ambiguity and structural ambiguity • Lexical mismatch and structural mismatch • Idioms and collocations • Ill-formed input • World knowledge
CAT Tools • Pre-editing and post-editing environments with linguistic analyses • Translation Memory • As the translator translates the text, each sentence (translation unit) is also saved automatically to a sophisticated translation unit database memory. As he translates, any similar sentence already in the memory will appear on screen for editing.(Ian Gordon) • Alignment Tools • Terminology Management
Standards • Exchange Standard • (Multilingual) Text Formats • Lexicons • Knowledge Bases • Translation Memories • Evaluation Standard
Future Direction • Exploratory Research or Prototype Research? • Modular Design (cf. Somers’ Comments) • Better Linguistic Theories • Lexicon Construction • Hybrid MT (Mainline MT engine + Additional Modules) • Spoken Language (Dialog-based) MT • MT Evaluation • Computer-Assisted Translation / User-Friendly Environment • Sub-languages MT Systems • Distributed MT / Networked MT • MT on Demand
References • Journal of Machine Translation (Kluwer) • Proceedings of TMI, MT Summit, AMTA • Proceedings of ACL, COLING, ROCLING • E-Print Archive http://xxx.lanl.gov/cmp-lg/ • AAMT http://www.jeida.or.jp/aamt/index-e.html • EAMT http://www.lim.nl/eamt/ • The Association for Computational Linguistics • http://www.cs.columbia.edu/~acl/ • The LINGUIST List http://www.linguistlist.org/ • Translation Research Group http://www.ttt.org/index.html • Localization Industry Standards Association (LISA) • http://www.lisa.unige.ch/
References • ISI @ USC http://www.isi.edu/natural-language/nlp-at-isi.html • CMU/LTI http://www.lti.cs.cmu.edu/Research/CMT-home.html • Verbmobil http://www.dfki.de/verbmobil/ • C-STAR II http://www.is.cs.cmu.edu/cstar/ • GETA http://durian.imag.fr/ • Machine Translation at PAHO (ACG/T) • http://www.paho.org/english/machine.htm • METEO http://padina.info.umoncton.ca/chandioux/meteoe.html • WordNet Bibliography • http://www.cis.upenn.edu/~josephr/wn-biblio.html
References • Globalink, Inc. http://www.globalink.com/ • SYSTRAN http://www.systransoft.com/ • Logos Corporation http://www.logos-ca.com/ • TRADOS http://www.trados.com/ • A.I.SOFT http://www.aisoft.co.jp/ • CSK Home Page http://www.csk.co.jp/home_e.html • SHARP SOFT • http://www.sharp.co.jp/sc/excite/soft_map/soft.htm • OKI Software http://www.okisoft.co.jp/ • KODENSHA http://www1.mesh.ne.jp/KODENSHA/ • ASTRANSAC http://eiplaza.toshiba.co.jp/products/transac/