SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation

SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com

Overview • SYSTRAN – 40 years of innovation • The MT Challenges • SYSTRANLab • Projects • Hybrid Engines • From Research to Products • CWMT08 • Conclusions

SYSTRAN • 40 years of history • Located in Paris (La Défense) and San Diego • +70 employees: ~ 20 linguists, ~ 30 engineers Including 10 PhDs

Core Technology • Core technology “Rule-Based” • Based on language description • Analysis – Transfer – Generation paradigm • Build a « syntax tree » based on hierarchical constituents with multi-level relationships • Multi-pass analysis • Morphology Analysis • Homograph Resolution • Clause Boundary • Syntagm Identification • Syntactic Role Identification • … • Rely heavily on linguistic resources

Chinese 882 Korean 78 Arabic 422 Italian 62 Spanish 358 Ukrainian 47 English 350 Polish 42 Hindi 325 Dutch 23 Portuguese 250 Serbo-Croatian 21 Russian 170 Greek 18 French 130 Czech 12 Japanese 125 Albanian 6 Urdu 100 Slovak 6 German 100 Farsi 82 22 source languages 70 language pairs Dictionaries: 200K-1M entries per LP ~6M reference multi-source / multi-target dictionary Languages 3600

SYSTRAN Activity • Retail products: Windows Desktop Product SYSTRAN Mobile on PDA Mac OS Dashboard Widget • Online Services SYSTRANBox, SYSTRANNet, SYSTRANLinks • Corporate customers Symantec, Cisco, Verizon, Ford, Daimler, Chemical Abstract… • Institutional Customers EC and US agencies • Portals - Online Translation “Babel Fish”, Google, Yahoo!, Microsoft Live, …

MT ChallengesRBMT/SMT Strengths and Weaknesses - I Rule-Based system builds a translation with available linguistic resources (dictionaries, rules) Human-built resources Incremental Track the translation process Predictable output Some phenomena are hard to formalize Need semantic/pragmatic knowledge Not designed to deal with exceptions to the rules … which are very frequent

MT ChallengesRBMT/SMT Strengths and Weaknesses - II Statistical system finds a translation within a choice of many, many possible translations Very easy to build Automatic training process Knowledge acquisition is easy… Not limited to predefined linguistic patterns – “phrase” … but cannot “understand” or generalize information Not even elementary rules Output is “unpredictable”

MT ChallengesCorpus-Based or Rule-Based Approach? No conflict between “corpus” and “rule-based” approaches Possible to learn rules Already learns terminology – monolingual and multilingual Some approaches acquire complex rules Possible to find the best translation amongst several translations “Decoding” can be constrained by syntactic restrictions Linguistic rules but corpus drives!

SYSTRANLab Research Projects Overview Toward Hybrid Engines Collaborations Statistical Post-Edition Lattice Decoding Source Analysis Adaptation From Research to Products

Research Projects • Resources Acquisition • Consolidating a 6M entry multilingual dictionary • Acquiring more from corpus – lexicon and rules • Linguistic Development • Entity Recognition with local grammars • Autonomous Generation modules • Introduction of corpus-based technology • Applications • More interactive applications • Professional Post-Edition Module (POEM)

SYSTRANLab Research Projects The Phoenix Project Collaboration with P. Koehn (University of Edinburgh) Introduce corpus-based decision modules in SYSTRAN Specialized modules Word Sense Disambiguation Lattice Generation Preposition / Determiner Choice

SYSTRANLab Research Projects The Sphinx Project Collaboration with CNRC Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition) GALE (DARPA Project) Participated in WMT07, NIST08

SYSTRANLab Research Projects The Pegasus Project Collaboration with H. Schwenk (Université du Maine) Introduce linguistic knowledge in statistical engines Participated in WMT08

SYSTRANLabHybrid Engines • Introduce Self-Learning capability • Learn “post-edition rules” Deep integration of statistical decision modules HYBRID Insert linguistic knowledge in statistical engines

CWMT08 • Chinese-English MT evaluation • Primary: RBMT+SPE • Contrast: RBMT • Started in 1994, 1.2M terms, S&T-focus

CWMT08: SPE Usage • SPE module trained on 1.8m sentences • CWMT08 training data not use • Not only translation by also annotation by RBMT • Dates, numerals, etc. • Transfer model is filtered • Exclusion of “bad rules” by rule based filtering • Examples are “random” quotes, entities appearing • Some expressions are “protected” • Constituents will be replaced with placeholders before SPE • Translated with RBMT • Re-injected in translation after SPE • SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)

Statistical Post-EditionA Case Study • Case Study – SYMANTEC – English>Chinese

Conclusions • Our approach is to start with rule-based framework • Developed techniques give very competitive results • Major focus on “degradation” control • Learn more advanced post-edition rules • Generic Translation – still a long way to go • Bigger still better? • Domain Translation • Quality is there – statistics provides adaptation and fluidity • Need dedicated applications, workflow • Bootstrapping new language pair development

SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation