TectoMT

TectoMT • two goals of TectoMT • to allow experimenting with MT based on deep-syntactic (tectogrammatical) transfer • to create a software framework into which various NLP software components could be integrated and tested within real life applications (such as MT) • developed at UFAL since 2005 • around 10 programmers using (and contributing to) TectoMT in 2008

Reminder 1: MT pyramidin terms of PDT layers • Key question in MT: optimal level of abstraction? • Our answer: somewhere around tectogrammatics • high generalization over different language characteristics, but still computationally (and mentally!) tractable

MT triangle: interlingua tectogram. surf.synt. morpho. raw text. source target language language Reminder 2:MT pyramid in TectoMT • modularity is emphasized in TectoMT  the MT task is implemented as a sequence of reusable NLP modules (called blocks) • around 80 blocks in the current version of English-Czech translation

What is new in TectoMT in 2008? • new blocks added • new applications created • large data processed and used

New blocks in TectoMT in 2008 • around 100 new blocks in 2008 • two types of extensions: • adding alternative (usually higher-performance) solutions to already implemented blocks, e.g. • McDonald's parser (Collins' parser and constituency-to-dependency conversion integrated already in 2005), • MORCE tagger (previously integrated taggers: TnT, MxPost, Jan Hajič's tagger, Lingua::EN::Tagger, Schmid's Tree Tagger) • blocks for new tasks • relatively isolated tasks such as Named Entity recognition in Czech and English • sequence of blocks for English sentence synthesis

New applications of TectoMT in 2008 • existing: • real-time tecto-analysis of Czech sentences integrated in tree editor TrEd • English sentence generator (within the Companions project) • sentence analysis for various purposes (intonation in TTS, information extraction) • segmentation of text into finite verb clauses • preprocessing of English text for the purpose of English-to-Hindi translation • pilot version in the very near future • simple man-machine dialog manager • Czech-to-English MT

Processing of large datain TectoMT • roughly 1GW of Czech texts • analyzed up to simplified tecto • for the purposes of modeling Czech sentences or their trees (functions as the target-side language model in our translation scenario) • roughly 60MW of parallel Czech-English texts from the Czeng corpus • analyzed up to simplified tecto and aligned • serves for generating several types of translation models

Plans for 2009 • introduce TectoMT to a larger audience (MT Marathon 2009) • experiment with more sophisticated tools during the tecto-transfer phase (loglinear combinations of translation and target-language tree models, tree HMM) • facilitate addition of new languages to be processed in TectoMT • performance tuning (now: roughly 1 translated sentence per second)

TectoMT

TectoMT

Presentation Transcript

TectoMT

Introduction to TectoMT