290 likes | 598 Views
A roadmap for MT : four « keys » to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand). International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002
E N D
A roadmap for MT : four « keys »to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand) International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002 Christian BoitetGETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53F-38041 Grenoble cedex 9, FranceChristian.Boitet@imag.fr, http://clips.imag.fr/geta
Outline • Basic concepts • What is MT ? • Goals: Quality / User • Architectures: Vauquois' triangle • State of the art • MT of texts: examples, problems • MT of spoken dialogs • The future of MT • Goals • 4 keys ICUKL2002, Goa, 25-29/11/2002
What is M(a)T ? • At least 3 types of automation • MT = Machine Translation • MAT = Machine Assisted Translation • MAHT = Machine Aided Human Translation • A scientific technology • Informatics (computer science) • Linguistics • Mathematics ICUKL2002, Goa, 25-29/11/2002
Goals: Quality / User ICUKL2002, Goa, 25-29/11/2002
Architectures: Vauquois' triangle ICUKL2002, Goa, 25-29/11/2002
Architekturen: Vauquois Dreieck (größer) ICUKL2002, Goa, 25-29/11/2002
Formal intermediate structures ICUKL2002, Goa, 25-29/11/2002
How to produce an MT system • Choose an architecture • Program the "tools" • Spezialized languages for linguistic programming (SSLP) • Development environment (MT shell) • Build the "lingware" • Lexical data / rules / weights • Grammatical data / rules / weights • Possible specialization to a typology ("sublanguage") • How? • Human work ± computer help / support • Automatic learning (weights, likeliness…) ICUKL2002, Goa, 25-29/11/2002
State of affairs • only a small number of language pairs is covered by MT systems designed for information access • Systran EC (2000): 19/110 language pairs, 8 OK for intended use • See also examples by Ronaldo Martins • even fewer are capable of quality translation or speech translation • Now a few examples… ICUKL2002, Goa, 25-29/11/2002
Examples: MT for access, Web (1) ICUKL2002, Goa, 25-29/11/2002
Examples: MT for access, Web (2) • FE quite "easy", compared with EG and mainly FG ICUKL2002, Goa, 25-29/11/2002
Comparison: raw vs rough MT ICUKL2002, Goa, 25-29/11/2002
Examples: MT for revisors… ICUKL2002, Goa, 25-29/11/2002
…with BV-aero/FE (2) ICUKL2002, Goa, 25-29/11/2002
MT of spoken dialogs • Specialized systems are already usable • e.g. ATR/Matsushita, IBM, CSTAR/Nespole!… • Much "noise" and "ungrammaticalities" • But specializing is very helpful! • General systems are also possible • e.g. NEC/Xroad, Linguatec/Talk&Translate • Speech recognition is already good enough • Rough may be good enough (e.g. for chatting) • Interpretation is different from translation… • …and participants are intelligent ! • Similarity with access-oriented-MT ICUKL2002, Goa, 25-29/11/2002
French-Korean through IF (1) ICUKL2002, Goa, 25-29/11/2002
French-Korean through IF (2) ICUKL2002, Goa, 25-29/11/2002
French-Korean through IF (3) ICUKL2002, Goa, 25-29/11/2002
A road map… to which goals? • MT of adequate quality • Not only for access • For all languages ICUKL2002, Goa, 25-29/11/2002
Four keys • 2 on the technical side • 2 on the organizational side • Compromize: a far wider coverage, a somewhat smaller asymptotic quality • Automatic learning techniques • Using non-textual pivots (intermediate formal descriptors) • Democratization, cooperation • Cooperative development of open source linguistic resourceson the Web • Towards systems where quality can be improved "on demand"by users ICUKL2002, Goa, 25-29/11/2002
Learning techniques • Extend the use of hybrid techniques • symbolic, numerical, or mixed • ==> they have demonstrated their potential at the research level • stochastic grammars • weighted (or "neural") dictionaries • or build new tools, intrinsically numerical • inspiration from voice recognition • 2 examples • learning analyzers : text —> semantic tree (IBM) • learning implicit very detailed DG from tree bank (NAIST) ICUKL2002, Goa, 25-29/11/2002
Using non-textual pivots • Semantico-pragmatic (ontological) pivots • task & domain oriented ==> limited applicability • Abstract linguistic descriptors • the most precise, but often too sophisticated • depend on each language • Anglo-semantic pivot: UNL • "the HTML of linguistic content" • in UNL, a hypergraph represents the abstract structure of (supposedly) equivalent English utterance • less precise but "robust" • symbols constructed from English ==> usable by all developers ICUKL2002, Goa, 25-29/11/2002
score(icl>event,agt>human,fld>sport).@entry.@past.@complete agt obj ins plt Ronaldo(icl>proper noun) head(pof>body).@def pos corner(icl>thing).@def goal(icl>abstract thing) pos mod goal(icl>concrete thing) left(aoj<thing) A simple UNL graph • Ronaldo has headed the ball into the left corner of the goal ICUKL2002, Goa, 25-29/11/2002
Cooperative development • of open source linguistic resources • on the Web • Mutualization is necessary at least for lexical knowledge • too costly even for the leaders • size (#entries) has to augment for each language (300K, 3M?) • #languages has to increase dramatically (11 —> 20 —> 180?) • Integration of human- and machine-oriented knowledge is useful • e.g. to produce mixed MT/MAHT systems ICUKL2002, Goa, 25-29/11/2002
A contribution: the Papillon project • Goal: • produce many open source dictionaries from a central lexical data base • Means: • build rich (DiCo) monolingual dictionaries of lexies (senses) • interlink lexies by interlingual links (axies) • use XML & associated tools as basis to generate many formats • for humans and for machines • start from (free) digital resources • induce "consumers" to become "producers" (contributors) • Quality control: • private accounts • central validating/integrating group ICUKL2002, Goa, 25-29/11/2002
Lexical Database Dictionary Dictionary User User User Resource Resource Resource Papillon database macrostructure Interaction withthe Dictionaries Extraction ofDictionaries Human Contributors Integration of existing resources ICUKL2002, Goa, 25-29/11/2002
French. DiCo Japan. DiCo Interlingual links Vocable carten.f. Lexie carte.1 carte à jouer Lexie carte.2 carte géographique カード Acception 343 UNL: card(icl>play),card(icl>thing)… 地図 Engl.DiCo Acception 345 UNL: map(fld>geography) Vocable cardN Lexie card.1playing card Lexie card.2 money card ThaiDiCo Acception 1002 UNL: card(fld>money) a Vocable=lexie map PAPILLON diagram • Interlingual links based on translations = "AXIEs" • Possibility to link 1 lexie with >1 acceptions • References to other semantic systems: AXIE—1————n—>UW ICUKL2002, Goa, 25-29/11/2002
Construct systems where quality can be improved "on demand" by users • a priori through interactive disambiguation in the source language • or a posteriori by correcting the pivot representation (UNL or other) through any language (as in MultiMeteo) • ==> In the 2 cases, all versions (in all languages) are improved • possibility to merge • MT • multilingual generation • computer-aided authoring ICUKL2002, Goa, 25-29/11/2002
Conclusion • 4 keys to open the door to MT of adequate quality to all languages • On the technical side, • dramatically increase the use of learning techniques • use pivot architectures, the most universally usable pivot being UNL • On the organizational side, • cooperatively develop open source linguistic resources on the web • construct systems where quality can be improved "on demand" by users • On the practical side, • seek keys to unlock private investment, public funding, voluntary cooperation • could this conference become a decisive turning point? ICUKL2002, Goa, 25-29/11/2002