580 likes | 683 Views
Deep Grammars in Hybrid Machine Translation. Helge Dyvik. University of Bergen. Lexicon, Lexical Semantics, Grammar, and Translation for Norwegian. A 4-year project (2002 - 2006) involving groups at: The University of Oslo The University of Bergen NTNU (The University of Trondheim)
E N D
Deep Grammars in Hybrid Machine Translation Helge Dyvik University of Bergen
Lexicon, Lexical Semantics, Grammar, and Translation for Norwegian • A 4-year project (2002 - 2006) involving groups at: • The University of Oslo • The University of Bergen • NTNU (The University of Trondheim) • Cooperation with PARC (John Maxwell) and others
The LOGON system Schematic architecture
XLE: Xerox Linguistic Environment • A platform developed over more than 20 years • at Xerox PARC (now PARC) • Developer: John Maxwell • LFG grammar development • Parsing • Generation • Transfer • Stochastic parse selection • Interaction with shallow methods
An LFG analysis: Det regnet 'It rained'
ParGram: The Parallel Grammar Project A long-term project (1993-) • Develops parallel grammars on XLE: • English, French, German, Norwegian, Japanese, Urdu, Welsh, Malagasy, Arabic, Hungarian, Chinese, Vietnamese • ‘Parallel grammars’ means parallel f-structures: A common inventory of features Common principles of analysis
LOGON Analysis Modules Output-input Supporting knowledge base Input string •Tokenization •Named ent. •Compounds •Morphology Norsk ordbank lexicon NorGram String of stems and tags c-structures LFG lexicons: •NKL-derived •Hand coded Lexical templates f-structures XLE Parser Syntactic rules MRSs Rule templates
Scope of NorGram • Lexicon: about 80 000 lemmas. In addition: Automatically analyzed compounds Automatically recognized proper names "Guessed" nouns • Syntax: 229 complex rules, giving rise to about 48 000 arcs • Semantics: Minimal Recursion Semantics projections for all readings
Coverage • Performance on an unknown corpus of newspaper text: • 17 randomly selected pieces of text, limited to coherent text, • comprising 1000 sentences • taken from 9 newspapers • Adresseavisen, Aftenposten, Aftenposten nett, Bergens Tidende, • Dagbladet, Dagens Næringsliv, Dagsavisen, Fædrelandsvennen, Nordlys, • from the editions on November 11th 2005.
The LOGON challenge: From a resource grammar based on independent linguistic principles, derive MRS structures harmonized with the MRS structures of the HPSG English Resource Grammar.
Semantics for translation: • Two issues • The representational subset problem- Desirable: normalization to flat structures with unordered elements. • Complete and detailed semantic analyses may be unnecessary.- Desirable: rich possibilities of underspecification
Basics of • Minimal Recursion Semantics • Developers: A. Copestake, D. Flickinger, R. Malouf, S. Rieheman, I. Sag • A framework for the representation of semantic information • Developed in the context of HPSG and machine translation (Verbmobil) • Sources of inspiration: - Quasi-Logical Form (H. Alshawi): underspecification, e.g. of quantifier scope - Shake-and-bake translation (P. Whitelock): a bag of words as interface structure
An MRS representation • is a bag of semantic entities (some corresponding to words, some not), each with a handle, • plus a bag of handle constraints allowing the underspecification of scope, • plus a handle and an index. • Each semantic entity is referred to as an Elementary Predication (EP). • Relations among EPs are captured by means of shared variables. • There are three elementary variable types: - handles (or 'labels') (h) - events (e) - referential indices (x)
From standard logical form to MRS «Every ferry crosses some fjord» Two readings: Replace operators with generalized quantifiers: every(variable, restriction, body) some(variable, restriction, body) The first reading (wide-scope every): var restriction body
Make the structure flat: • give each EP a handle • replace embedded EPs by their handles • collect all EPs on the same level (understood as conjunction)
Make the structure flat: • give each EP a handle • replace embedded EPs by their handles • collect all EPs on the same level (understood as conjunction) Wide scope: every Wide scope: some Underspecified scope by means of handle constraints:
Norwegian translation: «Hver ferge krysser en fjord» MRS as feature structure (also adding event variables):
Projecting MRS representations from f-structures «Katten sover» 'The cat sleeps'
Projecting MRS representations from f-structures «Katten sover» 'The cat sleeps'
mrs:: mrs::
Composition: Top-level MRS with unions of HCONS and RELS:
Post-processing this structure brings us back to the LOGON MRS format: http://decentius.aksis.uib.no/logon/xle-mrs.xml
bil 'car' (as in "Han kjøpte bil" 'He bought [a] car') No SPEC
disse hans mange spørsmål 'these his many questions' Multiple SPECs
Han jaget barnet ut nakent 'He chased the child out naked'
The Transfer Component Developer of the formalism: Stephan Oepen
Example of transfer Source sentence: Henter han bilen sin? fetches he car.DEF POSS.REFL.SG.MASC 'Does he fetch his car?' Alternative reading: 'Does he fetch the one of the car?'
Choosing the first reading of Henter han bilen sin? The variables have features. Interrogative is coded as [SF ques] on the event variable.
Two of four transfer outputs
Norwegian transfer input One of four English transfer outputs
Transfer formalism (Stephan Oepen) The form of a transfer rule: C = context I = input F = filter O = output
Simple example: Lexical transfer rule, transferring bekk into creek No context, no filter, only the predicate is replaced.
Example with a context restriction: gå en tur (lit. 'go a trip') is transferred into the light-verb construction take a trip. In the context of _tur_n as its second argument, _gå_v is transferred to _take_v.
The SEM-I (Semantic Interface) A documentation of the external semantic interface for a grammar, crucial for the writer of transfer rules. In order to enforce the maintaining of a SEM-I, LOGON parsing returns fail if every parse contains at least one predicate not in the SEM-I.
A small section of the verb part of the NorGram SEM-I Size of the Norwegian SEM-I: slightly less than 6000 entries
Parse Selection Parsing, transfer and generation may each give many solutions, leading to a fanout tree: The outputs at each of the three stages are statistically ranked.
The Parsebanker Efficient treebank building by discriminants Developer: Paul Meurer, Bergen Predecessors in discriminant analysis: David Carter (1997) Stephan Oepen, Dan Flickinger & al. (2003) Example of a four-way ambiguity: Det regnet 'It rained'/'It calculated'/'That one calculated'/'That rain'
1 2
3 4
Packed representations and discriminants (Paul Meurer)
Clicking on one discriminant is in this case sufficient to select a unique solution: