Deep Grammars in Hybrid Machine Translation

Deep Grammars in Hybrid Machine Translation Helge Dyvik University of Bergen

Lexicon, Lexical Semantics, Grammar, and Translation for Norwegian • A 4-year project (2002 - 2006) involving groups at: • The University of Oslo • The University of Bergen • NTNU (The University of Trondheim) • Cooperation with PARC (John Maxwell) and others

The LOGON system Schematic architecture

XLE: Xerox Linguistic Environment • A platform developed over more than 20 years • at Xerox PARC (now PARC) • Developer: John Maxwell • LFG grammar development • Parsing • Generation • Transfer • Stochastic parse selection • Interaction with shallow methods

An LFG analysis: Det regnet 'It rained'

ParGram: The Parallel Grammar Project A long-term project (1993-) • Develops parallel grammars on XLE: • English, French, German, Norwegian, Japanese, Urdu, Welsh, Malagasy, Arabic, Hungarian, Chinese, Vietnamese • ‘Parallel grammars’ means parallel f-structures: A common inventory of features Common principles of analysis

LOGON Analysis Modules Output-input Supporting knowledge base Input string •Tokenization •Named ent. •Compounds •Morphology Norsk ordbank lexicon NorGram String of stems and tags c-structures LFG lexicons: •NKL-derived •Hand coded Lexical templates f-structures XLE Parser Syntactic rules MRSs Rule templates

Scope of NorGram • Lexicon: about 80 000 lemmas. In addition: Automatically analyzed compounds Automatically recognized proper names "Guessed" nouns • Syntax: 229 complex rules, giving rise to about 48 000 arcs • Semantics: Minimal Recursion Semantics projections for all readings

Coverage • Performance on an unknown corpus of newspaper text: • 17 randomly selected pieces of text, limited to coherent text, • comprising 1000 sentences • taken from 9 newspapers • Adresseavisen, Aftenposten, Aftenposten nett, Bergens Tidende, • Dagbladet, Dagens Næringsliv, Dagsavisen, Fædrelandsvennen, Nordlys, • from the editions on November 11th 2005.

The LOGON challenge: From a resource grammar based on independent linguistic principles, derive MRS structures harmonized with the MRS structures of the HPSG English Resource Grammar.

Semantics for translation: • Two issues • The representational subset problem- Desirable: normalization to flat structures with unordered elements. • Complete and detailed semantic analyses may be unnecessary.- Desirable: rich possibilities of underspecification

Basics of • Minimal Recursion Semantics • Developers: A. Copestake, D. Flickinger, R. Malouf, S. Rieheman, I. Sag • A framework for the representation of semantic information • Developed in the context of HPSG and machine translation (Verbmobil) • Sources of inspiration: - Quasi-Logical Form (H. Alshawi): underspecification, e.g. of quantifier scope - Shake-and-bake translation (P. Whitelock): a bag of words as interface structure

An MRS representation • is a bag of semantic entities (some corresponding to words, some not), each with a handle, • plus a bag of handle constraints allowing the underspecification of scope, • plus a handle and an index. • Each semantic entity is referred to as an Elementary Predication (EP). • Relations among EPs are captured by means of shared variables. • There are three elementary variable types: - handles (or 'labels') (h) - events (e) - referential indices (x)

From standard logical form to MRS «Every ferry crosses some fjord» Two readings: Replace operators with generalized quantifiers: every(variable, restriction, body) some(variable, restriction, body) The first reading (wide-scope every): var restriction body

Make the structure flat: • give each EP a handle • replace embedded EPs by their handles • collect all EPs on the same level (understood as conjunction)

Make the structure flat: • give each EP a handle • replace embedded EPs by their handles • collect all EPs on the same level (understood as conjunction) Wide scope: every Wide scope: some Underspecified scope by means of handle constraints:

Norwegian translation: «Hver ferge krysser en fjord» MRS as feature structure (also adding event variables):

Projecting MRS representations from f-structures «Katten sover» 'The cat sleeps'

mrs::

mrs:: mrs::

Composition: Top-level MRS with unions of HCONS and RELS:  

Post-processing this structure brings us back to the LOGON MRS format: http://decentius.aksis.uib.no/logon/xle-mrs.xml

Examples

bil 'car' (as in "Han kjøpte bil" 'He bought [a] car') No SPEC

disse hans mange spørsmål 'these his many questions' Multiple SPECs

Han jaget barnet ut nakent 'He chased the child out naked'

The Transfer Component Developer of the formalism: Stephan Oepen

Example of transfer Source sentence: Henter han bilen sin? fetches he car.DEF POSS.REFL.SG.MASC 'Does he fetch his car?' Alternative reading: 'Does he fetch the one of the car?'

Parse output:

Choosing the first reading of Henter han bilen sin?

Choosing the first reading of Henter han bilen sin? The variables have features. Interrogative is coded as [SF ques] on the event variable.

Two of four transfer outputs

Norwegian transfer input One of four English transfer outputs

Generator output from the chosen transfer output

Transfer formalism (Stephan Oepen) The form of a transfer rule: C = context I = input F = filter O = output

Simple example: Lexical transfer rule, transferring bekk into creek No context, no filter, only the predicate is replaced.

Example with a context restriction: gå en tur (lit. 'go a trip') is transferred into the light-verb construction take a trip. In the context of _tur_n as its second argument, _gå_v is transferred to _take_v.

The SEM-I (Semantic Interface) A documentation of the external semantic interface for a grammar, crucial for the writer of transfer rules. In order to enforce the maintaining of a SEM-I, LOGON parsing returns fail if every parse contains at least one predicate not in the SEM-I.

A small section of the verb part of the NorGram SEM-I Size of the Norwegian SEM-I: slightly less than 6000 entries

Parse Selection Parsing, transfer and generation may each give many solutions, leading to a fanout tree: The outputs at each of the three stages are statistically ranked.

The Parsebanker Efficient treebank building by discriminants Developer: Paul Meurer, Bergen Predecessors in discriminant analysis: David Carter (1997) Stephan Oepen, Dan Flickinger & al. (2003) Example of a four-way ambiguity: Det regnet 'It rained'/'It calculated'/'That one calculated'/'That rain'

Packed representations and discriminants (Paul Meurer)

Clicking on one discriminant is in this case sufficient to select a unique solution:

Deep Grammars in Hybrid Machine Translation

Deep Grammars in Hybrid Machine Translation

Presentation Transcript

Machine Translation

Statistical XFER: Hybrid Statistical Rule-based Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Hybrid Data-Driven Models of Machine Translation

Deep Linguistic Information in Hybrid Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation

Machine Translation

Machine Translation

Deep Grammars in Hybrid Machine Translation