1.32k likes | 1.34k Views
This final report presents the study on generating English text from syntactic-semantic sentence representation, including the applications in machine translation and QA systems. The report evaluates the tectogrammatical representation under various circumstances and outlines the tools, data resources, and evaluation metrics used in the study. It also provides an overview of the statistical generation system, hybrid approach, evaluation results, and student project proposals.
Generation in the Context of MT Final Report
The Team • Senior members & affiliate members • Jan Hajič, Charles Univ., Prague Drago Radev, Univ. of Michigan • Gerald Penn, Univ. of Toronto Jason Eisner, Johns Hopkins Univ. • Owen Rambow, Univ. of Pennsylvania • Dan Gildea, Univ. of Pennsylvania Bonnie Dorr, Univ. of Maryland • Students: • Yuan Ding, Univ. of Pennsylvania Martin Čmejrek, Charles Univ., Prague • Terry Koo, MIT Kristen Parton, Stanford Univ. • Jan Cuřín, Charles University Ivona Kučerová, Charles University • Pre-workshop work (Charles University): • Zdeněk Žabokrtský Petr Pajas • Václav Honetschläger Alena Böhmová • Vladislav Kuboň Jiří Havelka
The Goal • Generate English (linear surface form) • from syntactic-semantic sentence representation (so-called “tectogrammatical”, or TR) • Possible application setting: • machine translation • other uses: • Front-end for QA systems, summarization • Evaluate under various circumstances
Tectogrammatical Representation According to his opinion UAL’s executives were misinformed about the financing of the original transaction
Tectogrammatical Representation According to he opinion UAL’s executive were misinform about the financing of the original transaction
TR in Machine Translation Vedení UAL bylo podle jeho názoru o financování původní transakce nesprávně informováno. NULL
TR trees WS’02 transfer deep syntax to surface syntax (tectogrammatics, TR) word order punctutation “surface” syntax lemmatized,POS lemmatized,POS morphology (gen.) morphology/tagging Target language textENGLISH The MT Framework Source language textCZECH
TR trees The MT Framework AR trees CZECH ENGLISH
Tools and Data Resources • Tools: • WS98 Czech parser + other Czech tools (tagger) • GIZA (WS99) + ISI decoder • Data: • PTB (40k sentences) • PTB translation to Czech (11k sentences) • Prague Dependency Treebank 1.0 (90k sentences) • Prague Dependency Treebank 2.0 preliminary • 15k sentences manually annotated • Monolingual data
The Evaluation Metric: BLEU • Plain English output (MT, Generation): • difficult and/or expensive to evaluate subjectively • BLEU (IBM): • automatic method, score 0..1 • relative scores subjective human evaluation • needs several reference “gold standards” • n-gram-based metric w/small-length penalty • Different “local” evaluations throughout, too
Presentation Outline • The Systems and Their Inputs • Getting the data & tools ready • The Statistical Generation System • The channel model • Word order, Punctuation, Morphology • The Hybrid Approach • Evaluation Results • Student Project Proposals • Conclusions and Future Directions
Where are we? Transfer English TR to AR Deep syntax (Czech) Word Order Punctuation Morphology CZECH ENGLISH
The Systems and Their Inputs Martin Čmejrek
WS02GMT System 1: statistical System 2: hybrid Output: English linear surface form Input 1: automatically created English TR Input 2: manually created English TR Input 3: improved automatic English TR (PropBank) Input 4: Czenglish TR (simple translation)
Input 1: Automatic English TR Penn Treebank v. 3 + heads (Jason Eisner’s code + modifications) + lemmatization + word IDs + rule-based transformation to English AR, TR (by Kučerová & Žabokrtský) English TR (I1), size: 40k sentences
Input 2: Manual English TR Penn Treebank v. 3 Input 1 + manual annotation (correction) (IK) including: deep word order, conversion of grammatical codes English TR (I2), size: 1.5k sentences
Input 3: Enhanced Automatic English TR Penn Treebank v. 3 Input 1 + PropBank + additional sources English TR (I3): size: 40k sentences
Input 4: Automatic Czenglish TR Linear Surface Czech + Czech tagging & lemmatization + Parsed to Czech AR, Czech TR + [Simple] Transfer (Lemma translation) - lexical replacement dictionary collected from web, MRDs + trained on TR lemmas by GIZA “Czenglish” TR (I4): 11k sentences
Dictionary Filtering Frequencies on English Monolingual Corpus (North American News Text) 365 M words 4 Czech/English Dictionary Sources (WinGED, GNU/FDL, PCTrans, EuroWordNet) Merging, Pruning Czech POS English POS Czech/English parallel Penn TreeBank Corpus GIZA++ Training Czech/English Dictionary for Transfer Input Data Source Output Data Tools
Word-by-word translation of TR lemmas • Word by word dictionary: 42 835 entries, 65408 translations • format: <e>tečka<t>N <tr>spot<trt>N<prob>0.353598 <tr>dot<trt>N<prob>0.28792 <tr>full @stop<trt>N<prob>0.28729 • 1-1, 1-2 (2-1 translations not yet implemented) • packed forest representation for multiple translation choice • simplified version – choose the first best
Where are we? w/additional info Transfer English TR to AR Deep syntax (Czech) Word Order Punctuation Morphology CZECH ENGLISH
Automatically Annotating a Tectogrammatical Corpus Owen Rambow
Goal • Use PropBank annotations to • Improve automatic construction of English TRs • Allow generation from “generic” pred-arg structures
Types of Corpus Annotation • Surface Syntax • Deep Syntax • Local Lexical Semantics • Global Lexical Semantics • Hybrid: Deep Syntactic/Global Semantic = Tectogrammatical level used here
loads prepobj subj obj John hay into comp trucks John loads hay into trucks Surface SyntaxE.g., Penn Treebank loaded prepobj prepobj subj by hay into is comp comp John trucks Hay is loaded into trucks by John
load obj2 subj obj John hay truck Deep SyntaxE.g., TAG John loads hay into trucks Hay is loaded into trucks by John
load arg1 arg0 arg2 John hay truck Local SemanticsPenn PropBank (brand new) John loads hay into trucks John loads trucks with hay
load throw goal goal agent agent theme theme John hay truck John hay truck Global SemanticsLCS (U. Md.) John loads hay into trucks John throws hay into trucks
Tectogrammatical Representation • First two syntactic arguments of verb: deep-syntactic • All other arguments: global semantic load load throw dir3 pat dir3 act act act acmp pat pat John hay truck John hay truck John hay truck John loads trucks with hay John loads hay into trucks John throws hay into trucks
Why Use TR? Research Hypothesis: • Replacing function words by TR arc labels makes transfer easier • Choice of realization: target language-dependent • Deep-syntactic labels for first two arguments: realization more verb-specific • Global semantic labels on remaining arguments: realization just label-specific
Available Resources for Input 3 • Surface syntax: PTB corpus (hand, checked) • Deep syntax: derived automatically from PTB (Chen01) • Local semantics: PropBank corpus and frame lexicon (hand, checked) • Global semantics: LCS lexicon (partially hand, partially checked) • TR: PTB subset corpus (hand), PropBank TR dictionary (hand, not checked) (I. Kučerová)
Experiment: Machine Learning of TR Labels Using Ripper • Ripper (Cohen 1996) = greedy symbolic rule learner, set- and bag-valued features • Features: • Surface, deep syntactic info • Local, global semantic info • Kučerová’s PropBank TR dictionary (hand-crafted) • Input 1 (Automatic English TR)
Results (TR Label Error Rates) Semantics none local local-global all PB TR dict none 58.8% 25.9% 23.7% 22.6% 37.7% Input 1 19.5% 17.7% 16.3% 15.9% 17.1% surface-deep 16.5% 16.4% 17.1% 16.7% 16.2% Syntax surface-deep-Inp1 15.5% 15.9% 16.2% 16.1% 14.4% Average accuracy on 5-fold cross-validation (1326 data points)
Conclusions • Machine learning can improve on hand-written conversion rules (= Input 1) • PropBank is useful • Best results: • All syntactic features + PropBank TR dictionary • Future work: use PropBank LCS dictionary (developed during workshop)
English TR to AR Word Order Punctuation Morphology Where are we? Transfer Deep syntax (Czech) CZECH ENGLISH
The MAGENTA System • Statistically based • The pipeline: • TR to AR by a channel model • Word order by reordering on dep. trees • Punctuation insertion • Morphology
Word Order Punctuation Morphology Where are we? Transfer English TR to AR Deep syntax (Czech) CZECH ENGLISH
The Tree-to-Tree Transductions a A Jason Eisner . C+D c b B d E prep prep f e F det det
misinform inform wrongly prep prep det det Translating trees a A c b B C+D learn this 2:1 mapping(or in dictionary) d E Also 1:2, 2:0, etc., &rearrangements ... f e F 0:1 mapping
prep det Translating trees a A c b B C+D d E f e F
Pred S S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her Statistical: Need a model of tree pairs Mainly interested in (TR,AR) pairs But our techniques are quite general E.g., example below is not a (TR,AR) pair “the girl kissed her kitty cat” “the girl gave a kiss to her cat”
“the girl gave a kiss to her cat” S S,gave NP PP NP,girl PP,to NP,kiss Det,a Det Det,the Det NP NP,cat Det Det,her Training: Our team has many tree pairs Should be nicer to model than string pairs - why we built them! What Czech trees went with what English trees in training? ... Learn parameters of a joint model P(T1,T2). “the girl kissed her kitty cat” Pred,kissed Pred Subj Obj Obj,cat Obj Subj,girl Det Det,the kitty Det Det,her
Pred Pred,kissed Obj,cat Subj Obj Subj,girl kitty Det Det Det,her Det,the Obj Decoding: Complete a tree pair Training: given T1 and T2 find to maximize P(T1,T2) Decoding: given T1 and find T2 to maximize P(T1,T2) Horrible sparse data problem - can’t just do tree lookup. “the girl kissed her kitty cat” ??
could be trained on zillionsof individual English AR trees train on paired trees could also take advantage ofEnglish-Czech dictionaries How should a model oftree pairs look? Joint model P(T1,T2). Wise to use noisy-channel form: P(T1 | T2) * P(T2) But any joint model will do.
S Pred S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). “the girl kissed her kitty cat” “the girl gave a kiss to her cat”
S Pred S,gave Pred,kissed Obj,cat NP PP NP,kiss Subj Obj PP,to NP,girl Subj,girl Det kitty NP Det Det,a Det Det NP,cat Det,the Det,the Det,her Det Obj Det,her How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). “the girl kissed her kitty cat” “the girl gave a kiss to her cat” different, bad alignment!
kiss gave a kiss • kitty cat cat • to “the girl kissed her kitty cat” “the girl gave a kiss to her cat” How should a model P (T1,T2) oftree pairs look? Intuition: some kind of correspondence between words. Try to learn correspondence using EM alignment (could seed with a dictionary). • So model must consider alignment: P(T1,T2,A) • Why A is complicated: • The correspondence isn’t 1 to 1 • Also need to model word order (indeed topology)
Solution : Use the right grammar formalism Grammars can assemble words or phrases into trees. Let’s work up to the “right” formalism. • Model must consider alignment: P(T1,T2,A) • Why A is complicated: • The correspondence isn’t 1 to 1 • Also need to model word order (indeed topology) • kiss gave a kiss • cat kitty cat • to “the girl kissed her kitty cat” “the girl gave a kiss to her cat”
S NP NP VP VP Det Det N N V NP NP the girl Det N Context-Free Grammar “the girl kissed her cat” S etc.
S,kissed S,kissed S S NP,girl NP NP VP,kissed VP,kissed VP,kissed NP,girl VP,kissed Det Det,the Det,the N,girl N,girl V, kissed NP,cat NP,cat NP N,girl N,girl V,kissed Det NP the the girl girl Det Det N,cat N,cat Augment CFG nonterminalswith headwords “the girl kissed her cat” S etc.