200 likes | 226 Views
SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation. Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com. Overview. SYSTRAN – 40 years of innovation The MT Challenges SYSTRANLab Projects Hybrid Engines From Research to Products CWMT08 Conclusions. SYSTRAN.
E N D
SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com
Overview • SYSTRAN – 40 years of innovation • The MT Challenges • SYSTRANLab • Projects • Hybrid Engines • From Research to Products • CWMT08 • Conclusions
SYSTRAN • 40 years of history • Located in Paris (La Défense) and San Diego • +70 employees: ~ 20 linguists, ~ 30 engineers Including 10 PhDs
Core Technology • Core technology “Rule-Based” • Based on language description • Analysis – Transfer – Generation paradigm • Build a « syntax tree » based on hierarchical constituents with multi-level relationships • Multi-pass analysis • Morphology Analysis • Homograph Resolution • Clause Boundary • Syntagm Identification • Syntactic Role Identification • … • Rely heavily on linguistic resources
Chinese 882 Korean 78 Arabic 422 Italian 62 Spanish 358 Ukrainian 47 English 350 Polish 42 Hindi 325 Dutch 23 Portuguese 250 Serbo-Croatian 21 Russian 170 Greek 18 French 130 Czech 12 Japanese 125 Albanian 6 Urdu 100 Slovak 6 German 100 Farsi 82 22 source languages 70 language pairs Dictionaries: 200K-1M entries per LP ~6M reference multi-source / multi-target dictionary Languages 3600
SYSTRAN Activity • Retail products: Windows Desktop Product SYSTRAN Mobile on PDA Mac OS Dashboard Widget • Online Services SYSTRANBox, SYSTRANNet, SYSTRANLinks • Corporate customers Symantec, Cisco, Verizon, Ford, Daimler, Chemical Abstract… • Institutional Customers EC and US agencies • Portals - Online Translation “Babel Fish”, Google, Yahoo!, Microsoft Live, …
MT ChallengesRBMT/SMT Strengths and Weaknesses - I Rule-Based system builds a translation with available linguistic resources (dictionaries, rules) Human-built resources Incremental Track the translation process Predictable output Some phenomena are hard to formalize Need semantic/pragmatic knowledge Not designed to deal with exceptions to the rules … which are very frequent
MT ChallengesRBMT/SMT Strengths and Weaknesses - II Statistical system finds a translation within a choice of many, many possible translations Very easy to build Automatic training process Knowledge acquisition is easy… Not limited to predefined linguistic patterns – “phrase” … but cannot “understand” or generalize information Not even elementary rules Output is “unpredictable”
MT ChallengesCorpus-Based or Rule-Based Approach? No conflict between “corpus” and “rule-based” approaches Possible to learn rules Already learns terminology – monolingual and multilingual Some approaches acquire complex rules Possible to find the best translation amongst several translations “Decoding” can be constrained by syntactic restrictions Linguistic rules but corpus drives!
SYSTRANLab Research Projects Overview Toward Hybrid Engines Collaborations Statistical Post-Edition Lattice Decoding Source Analysis Adaptation From Research to Products
Research Projects • Resources Acquisition • Consolidating a 6M entry multilingual dictionary • Acquiring more from corpus – lexicon and rules • Linguistic Development • Entity Recognition with local grammars • Autonomous Generation modules • Introduction of corpus-based technology • Applications • More interactive applications • Professional Post-Edition Module (POEM)
SYSTRANLab Research Projects The Phoenix Project Collaboration with P. Koehn (University of Edinburgh) Introduce corpus-based decision modules in SYSTRAN Specialized modules Word Sense Disambiguation Lattice Generation Preposition / Determiner Choice
SYSTRANLab Research Projects The Sphinx Project Collaboration with CNRC Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition) GALE (DARPA Project) Participated in WMT07, NIST08
SYSTRANLab Research Projects The Pegasus Project Collaboration with H. Schwenk (Université du Maine) Introduce linguistic knowledge in statistical engines Participated in WMT08
SYSTRANLabHybrid Engines • Introduce Self-Learning capability • Learn “post-edition rules” Deep integration of statistical decision modules HYBRID Insert linguistic knowledge in statistical engines
CWMT08 • Chinese-English MT evaluation • Primary: RBMT+SPE • Contrast: RBMT • Started in 1994, 1.2M terms, S&T-focus
CWMT08: SPE Usage • SPE module trained on 1.8m sentences • CWMT08 training data not use • Not only translation by also annotation by RBMT • Dates, numerals, etc. • Transfer model is filtered • Exclusion of “bad rules” by rule based filtering • Examples are “random” quotes, entities appearing • Some expressions are “protected” • Constituents will be replaced with placeholders before SPE • Translated with RBMT • Re-injected in translation after SPE • SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)
Statistical Post-EditionA Case Study • Case Study – SYMANTEC – English>Chinese
Conclusions • Our approach is to start with rule-based framework • Developed techniques give very competitive results • Major focus on “degradation” control • Learn more advanced post-edition rules • Generic Translation – still a long way to go • Bigger still better? • Domain Translation • Quality is there – statistics provides adaptation and fluidity • Need dedicated applications, workflow • Bootstrapping new language pair development