340 likes | 354 Views
This talk explores Meaning Representation Structure (MRS) and LKB generation, part of a larger research program on modular generation. It discusses interactions with RMRS and SEM-I, emphasizing data-driven techniques and application-independent features. The architecture emphasizes the necessity of well-formed input acceptance and idiomatic output. The talk highlights the importance of why MRS is essential, underscoring flat structures, independence of syntax, lexicalized generation, and composition principles. It delves into robust MRS, underspecification, and chart generation with the LKB. The process of lexical lookup, instantiating signs, and applying rules are elucidated, followed by chart generation procedures and root conditions. The talk also examines generation failures due to various MRS issues and suggests improvements using corpus-based techniques to enhance the quality of outputs. Finally, it delves into constraining generation for idiomatic output, emphasizing intersective modifier order, adjective ordering constraints, and the challenges of encoding such preferences in symbolic grammar.
E N D
Aims of this talk • Discuss MRS and LKB generation • Describe larger research programme: modular generation • Mention some interactions with other work in progress: • RMRS • SEM-I
Outline of talk • Towards modular generation • Why MRS? • MRS and chart generation • Data-driven techniques • SEM-I and documentation
Modular architecture Language independent component Meaning representation Language dependent realization string or speech output
Desiderata for a portable realization module • Application independent • Any well-formed input should be accepted • No grammar-specific/conventional information should be essential in the input • Output should be idiomatic
Architecture (preview) External LF SEM-I Internal LF specialization modules Chart generator control modules String
Why MRS? • Flat structures • independence of syntax: conventional LFs partially mirror tree structure • manipulation of individual components: can ignore scope structure etc • lexicalised generation • composition by accumulation of EPs: robust composition • Underspecification
An excursion: Robust MRS • Deep Thought: integration of deep and shallow processing via compatible semantics • All components construct RMRSs • Principled way of building robustness into deep processing • Requirements for consistency etc help human users too
Extreme flattening of deep output some every y dog1 every some x cat x y chase cat y dog1 chase x y x e x y e x y lb1:every_q(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat_n(x), lb5:dog_n_1(y), lb4:some_q(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase_v(e),ARG1(lb3,x), ARG2(lb3,y), h9 qeq lb2,h8 qeq lb5
Extreme Underspecification • Factorize deep representation to minimal units • Only represent what you know • Robust MRS • Separating relations • Separate arguments • Explicit equalities • Conventions for predicate names and sense distinctions • Hierarchy of sorts on variables
Chart generation with the LKB • Determine lexical signs from MRS • Determine possible rules contributing EPs (`construction semantics’: compound rule etc) • Instantiate signs (lexical and rule) according to variable equivalences • Apply lexical rules • Instantiate chart • Generate by parsing without string position • Check output against input
Lexical lookup for generation • _like_v_1(e,x,y) – return lexical entry for sense 1 of verb like • temp_loc_rel(e,x,y) – returns multiple lexical entries • multiple relations in one lexical entry: e.g., who, where • entries with null semantics: heuristics
Instantiation of entries • _like_v_1(e,x,y) & named(x,”Kim”) & named(y,”Sandy”) • find locations corresponding to `x’s in all FSs • replace all `x’s with constant • repeat for `y’s etc • Also for rules contributing construction semantics • `Skolemization’ (misleading name ...)
Lexical rule application • Lexical rules that contribute EPs only used if EP is in input • Inflectional rules will only apply if variable has the correct sort • Lexical rule application does morphological generation (e.g., liked, bought)
Chart generation proper • Possible lexical signs added to a chart structure • Currently no indexing of chart edges • chart generation can use semantic indices, but current results suggest this doesn’t help • Rules applied as for chart parsing: edges checked for compatibility with input semantics (bag of EPs)
Root conditions • Complete structures must consume all the EPs in the input MRS • Should check for compatibility of scopes • precise qeq matching is (probably) too strict • exactly same scopes is (probably) unrealistic and too slow
Generation failures due to MRS issues • Well-formedness check prior to input to generator (optional) • Lexical lookup failure: predicate doesn’t match entry, wrong arity, wrong variable types • Unwanted instantiations of variables • Missing EPs in input: syntax (e.g., no noun), lexical selection • Too many EPs in input: e.g., two verbs and no coordination
Improving generation via corpus-based techniques • CONTROL: e.g. intersective modifier order: • Logical representation does not determine order • wet(x) & weather(x) & cold(x) • UNDERSPECIFIED INPUT: e.g., • Determiners: none/a/the/ • Prepositions: in/on/at
Constraining generation for idiomatic output • Intersective modifier order: e.g., adjectives, prepositional phrases • Logical representation does not determine order • wet(x) & weather(x) & cold(x)
Adjective ordering • Constraints / preferences • big red car • * red big car • cold wet weather • wet cold weather (OK, but dispreferred) • Difficult to encode in symbolic grammar
Corpus-derived adjective ordering • ngrams perform poorly • Thater: direct evidence plus clustering • positional probability • Malouf (2000): memory-based learning plus positional probability: 92% on BNC
Underspecified input to generation We bought a car on Friday Accept: pron(x) & a_quant(y,h1,h2) & car(y) & buy(epast,x,y) & on(e,z) & named(z,Friday) and: pron(x) & general_q(y,h1,h2) & car(y) & buy(epast,x,y) & temploc(e,z) & named(z,Friday) And maybe: pron(x1pl) & car(y) & buy(epast,x,y) & temp_loc(e,z) & named(z,Friday)
Guess the determiner • We went climbing in _ Andes • _ president of _ United States • I tore _ pyjamas • I tore _ duvet • George doesn’t like _ vegetables • We bought _ new car yesterday
Determining determiners • Determiners are partly conventionalized, often predictable from local context • Translation from Japanese etc, speech prosthesis application • More `meaning-rich’ determiners assumed to be specified in the input • Minnen et al: 85% on WSJ (using TiMBL)
Preposition guessing • Choice between temporal in/on/at • in the morning • in July • on Wednesday • on Wednesday morning • at three o’clock • at New Year • ERG uses hand-coded rules and lexical categories • Machine learning approach gives very high precision and recall on WSJ, good results on balanced corpus (Lin Mei, 2004, Cambridge MPhil thesis)
SEM-I: semantic interface • Meta-level: manually specified `grammar’ relations (constructions and closed-class) • Object-level: linked to lexical database for deep grammars • Definitional: e.g. lemma+POS+sense • Linked test suites, examples, documentation
SEM-I development • SEM-I eventually forms the `API’: stable, changes negotiated. • SEM-I vs Verbmobil SEMDB • Technical limitations of SEMDB • Too painful! • `Munging’ rules: external vs internal • SEM-I development must be incremental
Role of SEM-I in architecture • Offline • Definition of `correct’ (R)MRS for developers • Documentation • Checking of test-suites • Online • In unifier/selector: reject invalid RMRSs • Patching up input to generation
Goal: semi-automated documentation [incr tsdb()] and semantic test-suite Lex DB ERG Documentation strings Object-level SEM-I Auto-generate examples semi-automatic Documentation examples, autogenerated on demand Meta-level SEM-I autogenerate appendix
Robust generation • SEM-I an important preliminary • check whether generator input is semantically compatible with grammars • Eventually: hierarchy of relations outside grammars, allowing underspecification • `fill-in’ of underspecified RMRS • exploit work on determiner guessing etc
Architecture (again) External LF SEM-I Internal LF specialization modules Chart generator control modules String
Interface • External representation • public, documented • reasonably stable • Internal representation • syntax/semantics interface • convenient for analysis • External/Internal conversion via SEM-I
Guaranteed generation? • Given a well-formed input MRS/RMRS, with elementary predications found in SEM-I (and dependencies) • Can we generate a string? with input fix up? negotiation? • Semantically bleached lexical items: which, one, piece, do, make • Defective paradigms, negative polarity, anti-collocations etc?
Next stages • SEM-I development • Documentation and test suite integration • Generation from RMRSs produced by shallower parser (or deep/shallow combination) • Partially fixed text in generation (cogeneration) • Further statistical modules: e.g., locational prepositions, other modifiers • More underspecification • Gradually increase flexibility of interface to generation