570 likes | 582 Views
Explore the development of logon, an LFG analysis system, and the ParGram project, utilizing XLE for parallel grammars in multiple languages from 1993 to present, aiming at efficient syntactic and semantic representations. Achievements include covering various principles of linguistic analysis and challenging resource grammar integration for translation semantics. The Minimal Recursion Semantics framework presented yields possibilities for underspecification and flat semantic structures.
E N D
Deep Grammars in Hybrid Machine Translation Helge Dyvik University of Bergen
Lexicon, Lexical Semantics, Grammar, and Translation for Norwegian • A 4-year project (2002 - 2006) involving groups at: • The University of Oslo • The University of Bergen • NTNU (The University of Trondheim) • Cooperation with PARC (John Maxwell) and others
The LOGON system Schematic architecture
XLE: Xerox Linguistic Environment • A platform developed over more than 20 years • at Xerox PARC (now PARC) • Developer: John Maxwell • LFG grammar development • Parsing • Generation • Transfer • Stochastic parse selection • Interaction with shallow methods
An LFG analysis: Det regnet 'It rained'
ParGram: The Parallel Grammar Project A long-term project (1993-) • Develops parallel grammars on XLE: • English, French, German, Norwegian, Japanese, Urdu, Welsh, Malagasy, Arabic, Hungarian, Chinese, Vietnamese • ‘Parallel grammars’ means parallel f-structures: A common inventory of features Common principles of analysis
LOGON Analysis Modules Output-input Supporting knowledge base Input string •Tokenization •Named ent. •Compounds •Morphology Norsk ordbank lexicon NorGram String of stems and tags c-structures LFG lexicons: •NKL-derived •Hand coded Lexical templates f-structures XLE Parser Syntactic rules MRSs Rule templates
Scope of NorGram • Lexicon: about 80 000 lemmas. In addition: Automatically analyzed compounds Automatically recognized proper names "Guessed" nouns • Syntax: 229 complex rules, giving rise to about 48 000 arcs • Semantics: Minimal Recursion Semantics projections for all readings
Coverage • Performance on an unknown corpus of newspaper text: • 17 randomly selected pieces of text, limited to coherent text, • comprising 1000 sentences • taken from 9 newspapers • Adresseavisen, Aftenposten, Aftenposten nett, Bergens Tidende, • Dagbladet, Dagens Næringsliv, Dagsavisen, Fædrelandsvennen, Nordlys, • from the editions on November 11th 2005.
The LOGON challenge: From a resource grammar based on independent linguistic principles, derive MRS structures harmonized with the MRS structures of the HPSG English Resource Grammar.
Semantics for translation: • Two issues • The representational subset problem- Desirable: normalization to flat structures with unordered elements. • Complete and detailed semantic analyses may be unnecessary.- Desirable: rich possibilities of underspecification
Basics of • Minimal Recursion Semantics • Developers: A. Copestake, D. Flickinger, R. Malouf, S. Rieheman, I. Sag • A framework for the representation of semantic information • Developed in the context of HPSG and machine translation (Verbmobil) • Sources of inspiration: - Quasi-Logical Form (H. Alshawi): underspecification, e.g. of quantifier scope - Shake-and-bake translation (P. Whitelock): a bag of words as interface structure
An MRS representation • is a bag of semantic entities (some corresponding to words, some not), each with a handle, • plus a bag of handle constraints allowing the underspecification of scope, • plus a handle and an index. • Each semantic entity is referred to as an Elementary Predication (EP). • Relations among EPs are captured by means of shared variables. • There are three elementary variable types: - handles (or 'labels') (h) - events (e) - referential indices (x)
From standard logical form to MRS «Every ferry crosses some fjord» Two readings: Replace operators with generalized quantifiers: every(variable, restriction, body) some(variable, restriction, body) The first reading (wide-scope every): var restriction body
Make the structure flat: • give each EP a handle • replace embedded EPs by their handles • collect all EPs on the same level (understood as conjunction)
Make the structure flat: • give each EP a handle • replace embedded EPs by their handles • collect all EPs on the same level (understood as conjunction) Wide scope: every Wide scope: some Underspecified scope by means of handle constraints:
Norwegian translation: «Hver ferge krysser en fjord» MRS as feature structure (also adding event variables):
Projecting MRS representations from f-structures «Katten sover» 'The cat sleeps'
Projecting MRS representations from f-structures «Katten sover» 'The cat sleeps'
mrs:: mrs::
Composition: Top-level MRS with unions of HCONS and RELS:
Post-processing this structure brings us back to the LOGON MRS format: http://decentius.aksis.uib.no/logon/xle-mrs.xml
bil 'car' (as in "Han kjøpte bil" 'He bought [a] car') No SPEC
disse hans mange spørsmål 'these his many questions' Multiple SPECs
Han jaget barnet ut nakent 'He chased the child out naked'
The Transfer Component Developer of the formalism: Stephan Oepen
Example of transfer Source sentence: Henter han bilen sin? fetches he car.DEF POSS.REFL.SG.MASC 'Does he fetch his car?' Alternative reading: 'Does he fetch the one of the car?'
Choosing the first reading of Henter han bilen sin? The variables have features. Interrogative is coded as [SF ques] on the event variable.
Two of four transfer outputs
Norwegian transfer input One of four English transfer outputs
Transfer formalism (Stephan Oepen) The form of a transfer rule: C = context I = input F = filter O = output
Simple example: Lexical transfer rule, transferring bekk into creek No context, no filter, only the predicate is replaced.
Example with a context restriction: gå en tur (lit. 'go a trip') is transferred into the light-verb construction take a trip. In the context of _tur_n as its second argument, _gå_v is transferred to _take_v.
The SEM-I (Semantic Interface) A documentation of the external semantic interface for a grammar, crucial for the writer of transfer rules. In order to enforce the maintaining of a SEM-I, LOGON parsing returns fail if every parse contains at least one predicate not in the SEM-I.
A small section of the verb part of the NorGram SEM-I Size of the Norwegian SEM-I: slightly less than 6000 entries
Parse Selection Parsing, transfer and generation may each give many solutions, leading to a fanout tree: The outputs at each of the three stages are statistically ranked.
The Parsebanker Efficient treebank building by discriminants Developer: Paul Meurer, Bergen Predecessors in discriminant analysis: David Carter (1997) Stephan Oepen, Dan Flickinger & al. (2003) Example of a four-way ambiguity: Det regnet 'It rained'/'It calculated'/'That one calculated'/'That rain'
1 2
3 4
Packed representations and discriminants (Paul Meurer)
Clicking on one discriminant is in this case sufficient to select a unique solution: