220 likes | 235 Views
Explore the theoretical background and practical applications of the TectoMT system, which utilizes deep syntactic dependency trees to optimize machine translation. Discover how TectoMT improves generalization across languages while remaining computationally feasible. Learn how formemes, a novel concept, can enhance the modularization and structuring of statistical models, leading to more effective utilization of limited training data.
E N D
Machine Translationusing Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague
Overview • Part I - theoretical background • Part II - TectoMT system
MT pyramid (in terms of PDT) • Key question in MT: optimal level of abstraction? • Our answer: somewhere around tectogrammatics • high generalization over different language characteristics, but still computationally (and mentally!) tractable
Basic facts about "Tecto" • introduced by Petr Sgall in 1960's • implemented in Prague Dep. Treebank 2.0 • each sentence represented as a deep-syntactic dependency tree • functional words accompanying an autosemantic word "collapse" with it into a single t-node, labeled with the autosemantic t-lemma • added t-nodes (e.g. because of pro-drop) • semantically indispensable syntactic and morphological categories rendered by a complex system of t-node attributes (functors+subfunctors, grammatemes for tense, number, degree of comparison, etc.)
SMT and limits of growth • current state-of-the-art approaches to MT • n-grams + large parallel (and also monolingual) corpora + huuuuge computational power • n-grams are very greedy! • availability (or even existence!) of more data? • example: Czech-English parallel data • ~1 MW - easy (just download and align some tens of e-books) • ~10 MW - doable (parallel corpus Czeng) • ~100 MW - not now, but maybe in a couple of years... • ~1 GW - ? • ~10 GW (~ 100 000 books) - Was it ever translated???
How could tecto help SMT? • n-gram view: • manifestations of lexemes are mixed with manifestations of language means expressing the relations between the lexemes and of other grammar rules • inflectional endings, agglutinative affixes, functional words, word order, punctuation orthographic rules ... • Itwill be deliveredtoMr. Green'sassistantsat thenearestmeeting. • training data sparsity • how could tecto ideas help? • within each sentence, clear separation of meaningful "signs" from "signs" which are only imposed by grammar (e.g. imposed by agreement) • clear separation of lexical, syntactical and morphological meaning components • modularization of the translation task potential for a better structuring of statistical models more effective exploatation of the limited training data
"Semitecto" • abstract sentence representation, tailored for MT purposes • motivation: • not to make decisions which are not really necessary for the MT process (such as distinguishing between many types of temporal and directional semantic complementations) • given the target-language "semitecto" tree, we want the sentence generation to be deterministic • slightly "below" tecto (w.r.t. the abstraction axis): • adopting the idea of separating lexical, syntactical and morphological meaning components; adopting the t-tree topology principles • adopting many t-node attributes (especially grammatemes, coreference, etc.) • but (almost) no functors, no subfunctors, no WSD, no pointers to valency dictionary, no tfa... • closer to the surface-syntax • main innovation: concept of formemes
Formemes • formeme = morphosyntactic language means expressing the dependency relation • n:v+6 (in Czech) = semantic noun which is on the surface expressed in the form of prepositional group in locative with preposition "v" • v:that+fin/a (in English) = semantic verb expressed in active voice as a head of subordinating clause introduced with the sub.conjunction "that" • obviously, sets of formeme values are specific for each of the four semantic parts of speech • in fact, formemes are edge labels partially substituting functors • what is NOT captured by formemes: • morphological categories imposed by grammar rules (esp. by agreement), such as gender, number and case for adjectives in attributive positions • morphological categories already represented by grammatemes, such as degree of comparison for adjectives, tense for verbs, number for nouns
Formemes in the tree • Example:It is extremely important that Iraq held elections to a constitutionalassembly.
Some more examples of proposed formemes • English • 661 adj:attr • 568 n:attr • 456 n:subj • 413 n:obj • 370 v:fin/a • 273 n:of+X • 238 adv: • 160 n:poss • 160 n:in+X • 146 v:to+inf/a • 92 adj:compl • 91 n:to+X • ... • 62 v:rc/a • ... • 51 v:that+fin/a • ... • 39 v:ger/a • Czech • 968 adj:attr • 604 n:1 • 552 n:2 • 497 v:fin/a • 308 n:4 • 260 adv: • 169 n:v+6 • 133 adj:compl • 117 v:inf • 104 n:poss • 86 n:7 • 82 v:že+fin/a • 77 v:rc/a • 63 n:s+7 • 53 n:k+3 • 53 n:attr • 50 n:na+6 • 47 n:na+4 • 42 v:aby+fin/a
Three-way transfer • translation process: (I have been asked by him to come -> Požádal mě, abych přišel) • 1. source language sentence analysis up to the "semitecto" layer • 2. tranfer of • lexemes (ask požádat ,come přijít) • formemes (v:fin/pv:fin/a , v:to+inf v:aby+fin/a) • grammatemes (tense=past1past , 0 verbmod=cdn) • 3. target language sentence synthesis from the "semitecto" layer
Adding statistics... translation model (e.g. from parallel corpus Czeng, 30MW) "binode" language model (e.g. from partially parsed Czech National Corpus, 100MW) P(lT |lS) P(fT |fS) P(lgov,ldep,f) source language target language
Goals • primary goal • to build a high-quality linguistically motivated MT system using the PDT layered framework, starting with English -> Czech direction • secondary goals • to create a system for testing the true usefulness of various NLP tools within a real-life application • to exploit the abstraction power of tectogrammatics • to supply data and technology for other projects
MT triangle: interlingua tectogram. surf.synt. morpho. raw text. source target language language Main design decisions • Linux + Perl • set of well-defined, linguistically relevant levels of language representation • neutral w.r.t. chosen methodology (e.g. rules vs. statistics) • in-house OO architecture as the backbone,but easy incorporation of external tools (parsers, taggers, lemmatizers etc.) • accent on modularity: translation scenario as a sequence of translation blocks (modules corresponding to individual NLP subtasks)
TectoMT - Example of analysis (1) • Sample sentence: It is extremely important that Iraq held elections to a constitutionalassembly.
TectoMT - example of analysis (2) • phrase-structure tree:
TectoMT - example of analysis (3) • analytical tree
TectoMT - example of analysis (4) • tectogrammatical tree (with formemes)
Heuristic alignment • Sentence pair: • It is extremely important that Iraq held elections to a constitutionalassembly. • Je nesmírně důležité, že v Iráku proběhly volby do ústavníhoshromáždění.
Formeme pairs extracted from parallel aligned trees • 593 adj:attr adj:attr • 290 v:fin/a v:fin/a • 282 n:1 n:subj • 214 adj:attr n:attr • 165 n:2 n:of+X • 152 adv: adv: • 149 n:4 n:obj • 102 n:2 n:attr • 86 n:v+6 n:in+X • 79 n:poss n:poss • 73 n:1 n:obj • 61 n:2 n:obj • 51 v:inf v:to+inf/a • 50 adj:compl adj:compl • 39 n:2 n: • 34 n:4 n:subj • 34 n:attr n:attr • 32 v:že+fin/a v:that+fin/a • 32 n:2 n:poss • 27 n:4 n:attr • 27 n:2 n:subj • 26 adj:attr n:poss • 25 v:rc/a v:rc/a • 20 v:aby+fin/a v:to+inf/a