1 / 23

Machine Translation using Tectogrammatics

Machine Translation using Tectogrammatics. Zden ěk Žabokrtský IFAL, Charles University in Prague. Overview. Part I - t heoretical b ackground Part II - TectoMT s ystem. MT pyramid ( in terms of PDT ). Key question in MT : optimal level of abstraction?

jarvis
Download Presentation

Machine Translation using Tectogrammatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Translationusing Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague

  2. Overview • Part I - theoretical background • Part II - TectoMT system

  3. MT pyramid (in terms of PDT) • Key question in MT: optimal level of abstraction? • Our answer: somewhere around tectogrammatics • high generalization over different language characteristics, but still computationally (and mentally!) tractable

  4. Basic facts about "Tecto" • introduced by Petr Sgall in 1960's • implemented in Prague Dep. Treebank 2.0 • each sentence represented as a deep-syntactic dependency tree • functional words accompanying an autosemantic word "collapse" with it into a single t-node, labeled with the autosemantic t-lemma • added t-nodes (e.g. because of pro-drop) • semantically indispensable syntactic and morphological categories rendered by a complex system of t-node attributes (functors+subfunctors, grammatemes for tense, number, degree of comparison, etc.)

  5. SMT and limits of growth • current state-of-the-art approaches to MT • n-grams + large parallel (and also monolingual) corpora + huuuuge computational power • n-grams are very greedy! • availability (or even existence!) of more data? • example: Czech-English parallel data • ~1 MW - easy (just download and align some tens of e-books) • ~10 MW - doable (parallel corpus Czeng) • ~100 MW - not now, but maybe in a couple of years... • ~1 GW - ? • ~10 GW (~ 100 000 books) - Was it ever translated???

  6. How could tecto help SMT? • n-gram view: • manifestations of lexemes are mixed with manifestations of language means expressing the relations between the lexemes and of other grammar rules • inflectional endings, agglutinative affixes, functional words, word order, punctuation orthographic rules ... • Itwill be deliveredtoMr. Green'sassistantsat thenearestmeeting. •  training data sparsity • how could tecto ideas help? • within each sentence, clear separation of meaningful "signs" from "signs" which are only imposed by grammar (e.g. imposed by agreement) • clear separation of lexical, syntactical and morphological meaning components •  modularization of the translation task  potential for a better structuring of statistical models  more effective exploatation of the limited training data

  7. "Semitecto" • abstract sentence representation, tailored for MT purposes • motivation: • not to make decisions which are not really necessary for the MT process (such as distinguishing between many types of temporal and directional semantic complementations) • given the target-language "semitecto" tree, we want the sentence generation to be deterministic • slightly "below" tecto (w.r.t. the abstraction axis): • adopting the idea of separating lexical, syntactical and morphological meaning components; adopting the t-tree topology principles • adopting many t-node attributes (especially grammatemes, coreference, etc.) • but (almost) no functors, no subfunctors, no WSD, no pointers to valency dictionary, no tfa... • closer to the surface-syntax • main innovation: concept of formemes

  8. Formemes • formeme = morphosyntactic language means expressing the dependency relation • n:v+6 (in Czech) = semantic noun which is on the surface expressed in the form of prepositional group in locative with preposition "v" • v:that+fin/a (in English) = semantic verb expressed in active voice as a head of subordinating clause introduced with the sub.conjunction "that" • obviously, sets of formeme values are specific for each of the four semantic parts of speech • in fact, formemes are edge labels partially substituting functors • what is NOT captured by formemes: • morphological categories imposed by grammar rules (esp. by agreement), such as gender, number and case for adjectives in attributive positions • morphological categories already represented by grammatemes, such as degree of comparison for adjectives, tense for verbs, number for nouns

  9. Formemes in the tree • Example:It is extremely important that Iraq held elections to a constitutionalassembly.

  10. Some more examples of proposed formemes • English • 661 adj:attr • 568 n:attr • 456 n:subj • 413 n:obj • 370 v:fin/a • 273 n:of+X • 238 adv: • 160 n:poss • 160 n:in+X • 146 v:to+inf/a • 92 adj:compl • 91 n:to+X • ... • 62 v:rc/a • ... • 51 v:that+fin/a • ... • 39 v:ger/a • Czech • 968 adj:attr • 604 n:1 • 552 n:2 • 497 v:fin/a • 308 n:4 • 260 adv: • 169 n:v+6 • 133 adj:compl • 117 v:inf • 104 n:poss • 86 n:7 • 82 v:že+fin/a • 77 v:rc/a • 63 n:s+7 • 53 n:k+3 • 53 n:attr • 50 n:na+6 • 47 n:na+4 • 42 v:aby+fin/a

  11. Three-way transfer • translation process: (I have been asked by him to come -> Požádal mě, abych přišel) • 1. source language sentence analysis up to the "semitecto" layer • 2. tranfer of • lexemes (ask  požádat ,come přijít) • formemes (v:fin/pv:fin/a , v:to+inf  v:aby+fin/a) • grammatemes (tense=past1past , 0  verbmod=cdn) • 3. target language sentence synthesis from the "semitecto" layer

  12. Adding statistics... translation model (e.g. from parallel corpus Czeng, 30MW) "binode" language model (e.g. from partially parsed Czech National Corpus, 100MW) P(lT |lS) P(fT |fS) P(lgov,ldep,f) source language target language

  13. Part II TectoMT System

  14. Goals • primary goal • to build a high-quality linguistically motivated MT system using the PDT layered framework, starting with English -> Czech direction • secondary goals • to create a system for testing the true usefulness of various NLP tools within a real-life application • to exploit the abstraction power of tectogrammatics • to supply data and technology for other projects

  15. MT triangle: interlingua tectogram. surf.synt. morpho. raw text. source target language language Main design decisions • Linux + Perl • set of well-defined, linguistically relevant levels of language representation • neutral w.r.t. chosen methodology (e.g. rules vs. statistics) • in-house OO architecture as the backbone,but easy incorporation of external tools (parsers, taggers, lemmatizers etc.) • accent on modularity: translation scenario as a sequence of translation blocks (modules corresponding to individual NLP subtasks)

  16. TectoMT - Example of analysis (1) • Sample sentence: It is extremely important that Iraq held elections to a constitutionalassembly.

  17. TectoMT - example of analysis (2) • phrase-structure tree:

  18. TectoMT - example of analysis (3) • analytical tree

  19. TectoMT - example of analysis (4) • tectogrammatical tree (with formemes)

  20. Heuristic alignment • Sentence pair: • It is extremely important that Iraq held elections to a constitutionalassembly. • Je nesmírně důležité, že v Iráku proběhly volby do ústavníhoshromáždění.

  21. Formeme pairs extracted from parallel aligned trees • 593 adj:attr adj:attr • 290 v:fin/a v:fin/a • 282 n:1 n:subj • 214 adj:attr n:attr • 165 n:2 n:of+X • 152 adv: adv: • 149 n:4 n:obj • 102 n:2 n:attr • 86 n:v+6 n:in+X • 79 n:poss n:poss • 73 n:1 n:obj • 61 n:2 n:obj • 51 v:inf v:to+inf/a • 50 adj:compl adj:compl • 39 n:2 n: • 34 n:4 n:subj • 34 n:attr n:attr • 32 v:že+fin/a v:that+fin/a • 32 n:2 n:poss • 27 n:4 n:attr • 27 n:2 n:subj • 26 adj:attr n:poss • 25 v:rc/a v:rc/a • 20 v:aby+fin/a v:to+inf/a

  22. Processing blocksin the current prototype 1) segment the input text intosentences 2) tokenize the sentences 3) morphological tagging 4) lemmatize each token 5) phrase-structure parsing 6) mark phrase heads 7) run constituencydependency transformation 8) mark subject nodes 9) derive the t-tree topology 10) label t-nodes with t-lemmas 11) assign coordination/apposition functors 12) mark finite clauses 13) detect grammatical co-reference in relative clauses 14) determine the semantic part of speech 15) fill grammateme attributes (number, tense, degree...) 16) detect the sentence modality 17) detect formeme 18) clone the source-language t-tree 19) translate t-lemmas using a simple 1:1 probabilistic lexicon 20) set the gender attribute according to the noun lemma 21) set the aspect attribute according to the verb lemma 22) predict the target-language formeme 23) resolve morphological agreement 24) expand complex verbs forms 25) add prepositions and conjunctions 26) perform conjugation and declination 27) resolve word order 28) add punctuation 29) perform vocalization of prepositions 30) concatenate the tokens into final sentence string

  23. Thank you !

More Related