Generation in the Context of MT

Generation in the Context of MT Final Report

The Team • Senior members & affiliate members • Jan Hajič, Charles Univ., Prague Drago Radev, Univ. of Michigan • Gerald Penn, Univ. of Toronto Jason Eisner, Johns Hopkins Univ. • Owen Rambow, Univ. of Pennsylvania • Dan Gildea, Univ. of Pennsylvania Bonnie Dorr, Univ. of Maryland • Students: • Yuan Ding, Univ. of Pennsylvania Martin Čmejrek, Charles Univ., Prague • Terry Koo, MIT Kristen Parton, Stanford Univ. • Jan Cuřín, Charles University Ivona Kučerová, Charles University • Pre-workshop work (Charles University): • Zdeněk Žabokrtský Petr Pajas • Václav Honetschläger Alena Böhmová • Vladislav Kuboň Jiří Havelka

The Goal • Generate English (linear surface form) • from syntactic-semantic sentence representation (so-called “tectogrammatical”, or TR) • Possible application setting: • machine translation • other uses: • Front-end for QA systems, summarization • Evaluate under various circumstances

Tectogrammatical Representation According to his opinion UAL’s executives were misinformed about the financing of the original transaction

Tectogrammatical Representation According to he opinion UAL’s executive were misinform about the financing of the original transaction

TR in Machine Translation Vedení UAL bylo podle jeho názoru o financování původní transakce nesprávně informováno. NULL

TR trees WS’02 transfer deep syntax to surface syntax (tectogrammatics, TR) word order punctutation “surface” syntax lemmatized,POS lemmatized,POS morphology (gen.) morphology/tagging Target language textENGLISH The MT Framework Source language textCZECH

TR trees The MT Framework AR trees CZECH ENGLISH

Tools and Data Resources • Tools: • WS98 Czech parser + other Czech tools (tagger) • GIZA (WS99) + ISI decoder • Data: • PTB (40k sentences) • PTB translation to Czech (11k sentences) • Prague Dependency Treebank 1.0 (90k sentences) • Prague Dependency Treebank 2.0 preliminary • 15k sentences manually annotated • Monolingual data

The Evaluation Metric: BLEU • Plain English output (MT, Generation): • difficult and/or expensive to evaluate subjectively • BLEU (IBM): • automatic method, score 0..1 • relative scores  subjective human evaluation • needs several reference “gold standards” • n-gram-based metric w/small-length penalty • Different “local” evaluations throughout, too

Presentation Outline • The Systems and Their Inputs • Getting the data & tools ready • The Statistical Generation System • The channel model • Word order, Punctuation, Morphology • The Hybrid Approach • Evaluation Results • Student Project Proposals • Conclusions and Future Directions

Where are we? Transfer English TR to AR Deep syntax (Czech) Word Order Punctuation Morphology CZECH ENGLISH

The Systems and Their Inputs Martin Čmejrek

WS02GMT System 1: statistical System 2: hybrid Output: English linear surface form Input 1: automatically created English TR Input 2: manually created English TR Input 3: improved automatic English TR (PropBank) Input 4: Czenglish TR (simple translation)

Input 1: Automatic English TR Penn Treebank v. 3 + heads (Jason Eisner’s code + modifications) + lemmatization + word IDs + rule-based transformation to English AR, TR (by Kučerová & Žabokrtský)  English TR (I1), size: 40k sentences

Input 2: Manual English TR Penn Treebank v. 3 Input 1 + manual annotation (correction) (IK) including: deep word order, conversion of grammatical codes  English TR (I2), size: 1.5k sentences

Input 3: Enhanced Automatic English TR Penn Treebank v. 3 Input 1 + PropBank + additional sources  English TR (I3): size: 40k sentences

Input 4: Automatic Czenglish TR Linear Surface Czech + Czech tagging & lemmatization + Parsed to Czech AR, Czech TR + [Simple] Transfer (Lemma translation) - lexical replacement dictionary collected from web, MRDs + trained on TR lemmas by GIZA  “Czenglish” TR (I4): 11k sentences

Dictionary Filtering Frequencies on English Monolingual Corpus (North American News Text) 365 M words 4 Czech/English Dictionary Sources (WinGED, GNU/FDL, PCTrans, EuroWordNet) Merging, Pruning Czech POS English POS Czech/English parallel Penn TreeBank Corpus GIZA++ Training Czech/English Dictionary for Transfer Input Data Source Output Data Tools

Word-by-word translation of TR lemmas • Word by word dictionary: 42 835 entries, 65408 translations • format: <e>tečka<t>N <tr>spot<trt>N<prob>0.353598 <tr>dot<trt>N<prob>0.28792 <tr>full @stop<trt>N<prob>0.28729 • 1-1, 1-2 (2-1 translations not yet implemented) • packed forest representation for multiple translation choice • simplified version – choose the first best

Where are we? w/additional info Transfer English TR to AR Deep syntax (Czech) Word Order Punctuation Morphology CZECH ENGLISH

Automatically Annotating a Tectogrammatical Corpus Owen Rambow

Goal • Use PropBank annotations to • Improve automatic construction of English TRs • Allow generation from “generic” pred-arg structures

Types of Corpus Annotation • Surface Syntax • Deep Syntax • Local Lexical Semantics • Global Lexical Semantics • Hybrid: Deep Syntactic/Global Semantic = Tectogrammatical level used here

loads prepobj subj obj John hay into comp trucks John loads hay into trucks Surface SyntaxE.g., Penn Treebank loaded prepobj prepobj subj by hay into is comp comp John trucks Hay is loaded into trucks by John

load obj2 subj obj John hay truck Deep SyntaxE.g., TAG John loads hay into trucks Hay is loaded into trucks by John

load arg1 arg0 arg2 John hay truck Local SemanticsPenn PropBank (brand new) John loads hay into trucks John loads trucks with hay

load throw goal goal agent agent theme theme John hay truck John hay truck Global SemanticsLCS (U. Md.) John loads hay into trucks John throws hay into trucks

Tectogrammatical Representation • First two syntactic arguments of verb: deep-syntactic • All other arguments: global semantic load load throw dir3 pat dir3 act act act acmp pat pat John hay truck John hay truck John hay truck John loads trucks with hay John loads hay into trucks John throws hay into trucks

Why Use TR? Research Hypothesis: • Replacing function words by TR arc labels makes transfer easier • Choice of realization: target language-dependent • Deep-syntactic labels for first two arguments: realization more verb-specific • Global semantic labels on remaining arguments: realization just label-specific

Available Resources for Input 3 • Surface syntax: PTB corpus (hand, checked) • Deep syntax: derived automatically from PTB (Chen01) • Local semantics: PropBank corpus and frame lexicon (hand, checked) • Global semantics: LCS lexicon (partially hand, partially checked) • TR: PTB subset corpus (hand), PropBank  TR dictionary (hand, not checked) (I. Kučerová)

Experiment: Machine Learning of TR Labels Using Ripper • Ripper (Cohen 1996) = greedy symbolic rule learner, set- and bag-valued features • Features: • Surface, deep syntactic info • Local, global semantic info • Kučerová’s PropBank  TR dictionary (hand-crafted) • Input 1 (Automatic English TR)

Results (TR Label Error Rates) Semantics none local local-global all PB TR dict none 58.8% 25.9% 23.7% 22.6% 37.7% Input 1 19.5% 17.7% 16.3% 15.9% 17.1% surface-deep 16.5% 16.4% 17.1% 16.7% 16.2% Syntax surface-deep-Inp1 15.5% 15.9% 16.2% 16.1% 14.4% Average accuracy on 5-fold cross-validation (1326 data points)

Conclusions • Machine learning can improve on hand-written conversion rules (= Input 1) • PropBank is useful • Best results: • All syntactic features + PropBank  TR dictionary • Future work: use PropBank  LCS dictionary (developed during workshop)

English TR to AR Word Order Punctuation Morphology Where are we? Transfer Deep syntax (Czech) CZECH ENGLISH

The MAGENTA System • Statistically based • The pipeline: • TR to AR by a channel model • Word order by reordering on dep. trees • Punctuation insertion • Morphology

Word Order Punctuation Morphology Where are we? Transfer English TR to AR Deep syntax (Czech) CZECH ENGLISH

The Tree-to-Tree Transductions a A Jason Eisner . C+D c b B d E prep prep f e F det det

misinform inform wrongly prep prep det det Translating trees a A c b B C+D learn this 2:1 mapping(or in dictionary) d E Also 1:2, 2:0, etc., &rearrangements ... f e F 0:1 mapping

prep det Translating trees a A c b B C+D d E f e F

Pred S S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her Statistical: Need a model of tree pairs Mainly interested in (TR,AR) pairs But our techniques are quite general E.g., example below is not a (TR,AR) pair “the girl kissed her kitty cat” “the girl gave a kiss to her cat”

“the girl gave a kiss to her cat” S S,gave NP PP NP,girl PP,to NP,kiss Det,a Det Det,the Det NP NP,cat Det Det,her Training: Our team has many tree pairs Should be nicer to model than string pairs - why we built them! What Czech trees went with what English trees in training? ... Learn parameters  of a joint model P(T1,T2). “the girl kissed her kitty cat” Pred,kissed Pred Subj Obj Obj,cat Obj Subj,girl Det Det,the kitty Det Det,her

Pred Pred,kissed Obj,cat Subj Obj Subj,girl kitty Det Det Det,her Det,the Obj Decoding: Complete a tree pair Training: given T1 and T2 find  to maximize P(T1,T2) Decoding: given T1 and  find T2 to maximize P(T1,T2) Horrible sparse data problem - can’t just do tree lookup. “the girl kissed her kitty cat” ??

could be trained on zillionsof individual English AR trees train on paired trees could also take advantage ofEnglish-Czech dictionaries How should a model oftree pairs look? Joint model P(T1,T2). Wise to use noisy-channel form: P(T1 | T2) * P(T2) But any joint model will do.

S Pred S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). “the girl kissed her kitty cat” “the girl gave a kiss to her cat”

S Pred S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). “the girl kissed her kitty cat” “the girl gave a kiss to her cat” different, bad alignment!

kiss  gave a kiss • kitty cat  cat •   to “the girl kissed her kitty cat” “the girl gave a kiss to her cat” How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). • So model must consider alignment: P(T1,T2,A) • Why A is complicated: • The correspondence isn’t 1 to 1 • Also need to model word order (indeed topology)

Solution : Use the right grammar formalism Grammars can assemble words or phrases into trees. Let’s work up to the “right” formalism. • Model must consider alignment: P(T1,T2,A) • Why A is complicated: • The correspondence isn’t 1 to 1 • Also need to model word order (indeed topology) • kiss  gave a kiss • cat  kitty cat •   to “the girl kissed her kitty cat” “the girl gave a kiss to her cat”

S NP NP VP VP Det Det N N V NP NP the girl Det N Context-Free Grammar “the girl kissed her cat” S etc.

S,kissed S,kissed S S NP,girl NP NP VP,kissed VP,kissed VP,kissed NP,girl VP,kissed Det Det,the Det,the N,girl N,girl V, kissed NP,cat NP,cat NP N,girl N,girl V,kissed Det NP the the girl girl Det Det N,cat N,cat Augment CFG nonterminalswith headwords “the girl kissed her cat” S etc.

Generation in the Context of MT

Generation in the Context of MT

Presentation Transcript

The Story of MT

AccessIT in the context of Europeana

Marketing in the Context of Extension

SOS in the context of INSPIRE

SONG Exoplanet Searches in the Context of Next Generation Exoplanet Surveys

Resources , Agents and Processes in the context of Next Generation World Wide Web

Use of Ions in MT

Hubness in the Context of Feature Selection and Generation

EMPLOYMENT GENERATION IN CONTEXT OF SE AIRPORT CAPACITY EXPANSION TO 2030

Shuga in the Context of The Partnership for an HIV-Free Generation

Automated Generation of Context-Aware Tests

The Notion of Context in

Sudbury in the Context of Impact

The role of context in

Discovery – The Next Generation!: Business Context of Risk Presentation to the

In the context of ITS

MT in the NCLT

Injection Safety in the Context of

Security in the Context of Dependability

The Role of Context in Interpretation

Generation in the Context of MT

MT in the NCLT