300 likes | 466 Views
Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT. ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble. Outline. Introduction Multilingual MT-R (for revisors): linguistic methodology & basic software
E N D
Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble
Outline • Introduction • Multilingual MT-R (for revisors): linguistic methodology & basic software • Goals and linguistic methodology • Ariane-G5, an MT shell for building multilingual MT-R systems • What has been and is done with Ariane-G5:MT-R, MT-A (for authors), MT of speech • Representation of input documents • Structuration of corpuses • Functionalities during processing
MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY • Produce RAW translation GOOD ENOUGH to be revised • Specialize to SUBLANGUAGES and use • MULTILEVEL TRANSFER (semantic + traces) • HEURISTIC PROGRAMMING
relative to “variants” => Ariane-G5: 2 specialized DB • DB of lingware components • Declaration of variables (= typed attributes), templates… • Dictionaries • Grammars (rules = transitions of abstract automata) • DB of texts • Corpuses • Source texts • Intermediate results • Translations (± revisions)
What has been and is done with Ariane-G5: • MT-R (for revisors) • Large, operational systems: RU—>FR, FR—>EN • Prototypes: EN—>MY, TH, FR • Lots of mockups • MT-A (for authors) • LIDIA mockups: FR—>DE, EN, RU (adding CH) • MT of speech (for task-oriented dialogues) • CSTAR demo system (EN, DE, KR, IT, FR, JP)
MT-R examples of translation (1) • français-anglais en aéronautique (avant révision humaine)
Question 1 O des tasses bleues et des assiettes bleues O des assiettes bleues et des tasses Question 2 O capitaine de marine O capitaine d’aviation O capitaine d’artillerie O capitaine d’infanterie O capitaine de cavalerie O … MT-A example of a disambiguation dialogue • Le capitaine a rapporté des tasses et des assiettes bleues • —> The captain has brought back blue bowls and plates/ bowls and blue plates
Interaction in source for the “quality MT for all” • Example scenario : multilingual e-mail (UNL) interactive disambiguation server 4 analysis server 5 6 7 3 1 enconversion server e-mail server 2 e-mail tool Nicknames + language preferences 8 decoding server 9 decoding server decoding server decoding server 10 decoding server deconversion servers Addressees’ e-mail servers
Other future possibility: production of multilingual “self-explaining documents”
Analysis into IF Backgeneration Speech Translation:advantages of an Interchange Format • N target languages for the cost of one analysis • Translating into one’s language from N source languages with one generation • Using the same generation to “backgenerate” IF
Interchange Format : example • la semaine du 12nous avons des chambres simples et doublesdisponibles • give-information+availability+room(room-type=(single ; double), time=(week, md12)) • give-information • +availability+room • (room-type=(single ; double), time=(week, md12)) Acte de dialogue Concepts Arguments
Reconnaissance IF Génération Interface of CLIPS++ CSTAR-II demonstrator Rétrogénération (pour contrôler la “compréhension”)
Contrôle, IFF Synthèse VC IU FIF Reco Ethernet Montpellier RNIS Hardware architecture of the CLIPS++ CSTAR-II demonstrator Grenoble
Steps in translating a text • Build its hierarchical structure • Chapters, sections, paragraphs, [sentences] • Segment into translation units • According to current length parameter [min..max] • Translate each segment • Adding segment results to text results for desired phases • Revise (manually) the whole translations, keep the revisions
Representations of input documents • 3 main questions: • how to represent the writing system, • separate formatting tags from the text or not, • how to handle non-textual elements (figures, icons, or formulas) contained in utterances • Transliterations of textual elements • Keeping formatting tags in the texts • Non-textual elements
Transliterations of textual elements • Facilitate string-matching operations • Diminish the size of dictionaries • Represent diacritics • Make some processing easier for some tools • kataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua
Transliterations of textual elements (2) • Represent writing systems using non Roman characters • "мать" (mother) —> "MATQ" and not "MAT6" 今日は京都へ行きます。 (Today theme Kyoto dest go.) —> KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.
Keeping formatting tags in the texts • If the translation units get larger, almost all tags become “inside tags” • Tags often have a linguistic role For example, a sentence may contain • a bullet list • or a numbered list which are normally linguistically homogeneous. <P>For example, a sentence may contain</P> <UL> <LI>a bullet list <LI>or a numbered list </UL> <P>which are normally linguistically homogeneous. </P>
Non-textual elements • Formulas, figures, icons, brand names, anchors, links… • are often best replaced by tags or special occurrences • The situation may be recursive (text inside figures) *IF x2+5y>3 , x+y IS CONVENIENT . *IF <relation 1> , <entity 2> IS CONVENIENT . *IF $$R-1 , $$E-2 IS CONVENIENT .
Structuration of corpuses • Motivations for corpuses • Segmentation and structuration • Representation of texts, intermediate results, translations and revisions
Motivations for corpuses • Corpus = collection of texts sharing • some factual characteristics: • natural language • transliteration and method for handling formatting information and non-textual elements • segmentation method • structuration method • some management information: • source (journal/volume, book/chapter…) • usage destination (send back, postedit, tests…)
Segmentation and structuration • "segmentation" • = input texts —> words, sentences… • best done by the morphological analyzer • & units of translation • "structuration" • = segmentation —> higher level units • paragraphs, sections, etc. • + production of a corresponding tree structure • In Ariane-G5, up to 7 hierarchical separators • for a given corpus
Representation of texts, intermediate results, translations and revisions • Corpus = list of text files + descriptor • Text = (transliterated) text + descriptor • (+ non-textual elements replaced by tags or spec.occs) • Intermediate result = list of decorated trees • + descriptor (lingware variant + interval processed) • Translation = (transliterated) text + descriptor • (transliterated form may reduce morph. gen. size) • Revision = (transliterated) text + descriptor • (usually another, more natural transliteration)
Functionalities during processsing • Ensuring coherence between lingware and results • Stopping & restarting processing of a text • Reusing intermediate results • recovery from interruptions • debugging • multitarget translation (analysis ≈ 2/3 of translation time)
Conclusion and perspectives (1) • Text & corpus handling in complete MT systems is quite complex and interesting… • handling texts and corpuses not a straightforward problem, • suggests many interesting technological and scientific issues
Conclusion and perspectives (2) • but more is coming: • Synergy MT systems <—> TA (Translation Aids) • unification of the representations of texts in both worlds: • • MT: revised texts structured as input texts, • => the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results) • • TA: translation memories from "bags" to structured translation memories (keeping the sequential context) • both: multiple-layer translation memories • • lemmatized forms • • "concrete" syntactic trees & "abstract" logico-semantic trees • • formatting tags
Conclusion and perspectives (3) • Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine » • some sublanguages are good for MT, bad for TA • weather bulletins • others are good for TA, bad for MT • weather related warnings, slightly modified versions of already translated documents • and others are best kept for specialists • Fine-tune legal sentences