500 likes | 598 Views
Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania. John's Hopkins University April 18, 2000. Penn approach. Relies on lexically based linguistic analysis Humans annotate naturally occurring text (hand correct output of automatic parsers, e.g. Fiddich, XTAG)
E N D
Sense Tagging the Penn TreeBankMartha PalmerCIS and IRCSUniversity of Pennsylvania John's Hopkins UniversityApril 18, 2000
Penn approach • Relies on lexically based linguistic analysis • Humans annotate naturally occurring text • (hand correct output of automatic parsers, e.g. Fiddich, XTAG) • Train statistical POStaggers, parsers, etc. • Common thread is predicate-argument structure Hypothesis: More linguistically sophisticated analyzers More accurate output
Past results • XTAG project http://www.cis.upenn.edu/~xtag/ • Penn TreeBank http://www.cis.upenn.edu/~treebank/ • Enabled the development of tools: POStaggers, parsers, co-reference, etc http://www.ircs.upenn.edu/knowledge/licensing.html
Under TIDES: • Annotations enriched with semantics and pragmatics • Provide companion lexicons for annotated corpora • Extend our coverage to other languages (Chinese, Korean) Hypothesis– Parallel annotated corpora/lexicons will enable rapid ramp-up of MT
Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks
Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks
Machine Translation Lexical Choice- Word Sense Disambiguation • Iraq lost the battle. • Ilakuka centwey ciessta. • [Iraq ] [battle] [lost]. • John lost his computer. • John-i computer-lul ilepelyessta. • [John] [computer] [misplaced].
Cross-linguistic Information Retrieval-sense ambiguities • English • speed of light > *plea bargaining, • (speedier trials, lighter sentences) • Multilingual: French => English • saisi stupefiant > drug seizure, *grip narcotic, *understand stupefying,...
lose OBJ SUBJ Predicate-argument structures for lose • lose1 (Agent: animate, • Patient: physical-object) • lose2 (Agent: animate, • Patient: competition) • Agent <=> subj • Patient <=> obj • ACL-81, ACL-85,ACL86,MT90,CUP90,AIJ93
Word sense disambiguation withSource Language Semantic Class Constraints(co-occurrence patterns + backoff) lose1(Agent, Patient: competition) <=> ciessta lose2 (Agent, Patient: physobj) <=> ilepelyessta
Word sense disambiguation with Target Language Semantic Constraints receive <=> {patassta, swusinhayssta} patassta(Recipient, Patient: physical-obj) swusinhayssta (Recipient, Patient: communication) TAG+94, CSLI99,AMTA94
break smash shatter snap ? da po - irregular pieces da sui - small pieces pie duan -line segments Lexical Gaps: English to Chinese
Word sense disambiguation with Source Language Neighbors and Target Language Semantic Class Constraints break {smash, shatter, snap, etc.} <=> {da sui, da po, pie duan, di po,...} da sui (Agent, Patient: small and brittle) da po (Agent, Patient: concrete, inflexible object) pie duan(Agent, Patient: line segment shape) ACL94, MTJ95
Levin classes (3100 verbs) • 47 top level classes, 150 second and third level • Based on pairs of syntactic frames. • John broke the jar. / Jars break easily. / The jar broke. • John cut the bread. / Bread cuts easily. / *The bread cut. • John hit the wall. / *Walls hit easily. / *The wall hit. • Reflect underlying semantic components • contact, directed motion, • exertion of force, change of state • Synonyms, syntactic patterns, relations
Confusions in Levin classes? • Not semantically homogenous • {braid, clip, file, powder, pluck, etc...} • Multiple class listings • homonymy or polysemy? • Alternation contradictions? • Carry verbs disallow the Conative, but include • {push,pull,shove,kick,draw,yank,tug} • also in Push/pull class, does take the Conative
Regular Sense Extensions • John pushed the chair. +force, +contact • arg0 arg1 • John pushed the chairs apart. +ch-state • John pushed the chairs across the room. +ch-loc • John pushed at the chair. -ch-loc • The train whistled into the station. +ch-loc • The truck roared past the weigh station. +ch-loc AMTA98,ACL98,TAG98
Intersective Levin Classes • More syntactically and semantically coherent • sets of syntactic patterns • explicit semantic components • relations between senses • VERBNET
Manner of Motion Verbs: • Roll verbs • The ball rolled (down the hill.) • Down the hill rolled the ball. • Bill rolled the ball down the hill. • The ball rolled free. • The ball rolled 3 feet.
Manner of Motion Verbs: • Run verbs • The horse jumped (over the stream). • The horse jumped the stream. • Over the stream jumped the horse. • The rider jumped the horse over the stream. • The horse jumped himself into a lather. • The horse jumped five feet. • He made/went for a jump.
Manner of Motion Verbs: • Roll verbs • The ball rolled down the hill. arg1 argP • Bill rolled the ball down the hill. arg0 arg1 argP • Run verbs • The horse jumped. arg0 • The rider jumped the horse over the stream. argA arg0 argP
Portuguese – similar patterns,except for… • Same • Bounce/ rebater • Float/ flutuar • Roll/rolar • Slide/deslizar • Different (no causative) • Drift/derivar • Planar/glide
Machine Translation - Head switching • The log floated into the cave. • A madeira entrou na caverna flutuando. • [log] [entered] [cave] [floating]
Treatment for Head Switching uses: • Cross-linguistic generalizations • based on • Intersective Levin classes • implemented in • Feature-based Lexicalized Tree-Adjoining Grammars
Partial derivations for Head Switching - STAG Transfer Lexicon AMTA96, KLUWER98
Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks
Preparing Training Data • WordNet - online lexical resource • ISA relations, Part-whole, synonym sets • Poor inter-annotator agreement, 70-80% - lose • No predicate-argument structures or constraints • SENSEVAL/SIGLEX98: (Brighton, Sep,98) • Workshop on Word Sense Disambiguation • 34 words, corpus-based sense inventory • Inter-annotator agreement over 90%
8 major distinctions 3 Shake up 3 Shake down 2 Shake out 2 Shake off 8 major distinctions 2 shake up 2 shake off ShakeHector vs. WordNet
Approach • Revising WordNet with VerbNet • corpus-based sense distinctions • explicit predicate-argument structure and constraints • Provides more coarse-grained sense distinctions for easy mapping to other lexical resources • Sense tagging Penn TreeBank
Semantic Annotation –Hoa Dang, Joseph Rosenzweig, John Duda • Current syntactic annotation • POS, phrase structure bracketing • Logical Subject, locative, temporal adjuncts • New semantic augmentations • Sense tag verbs and noun arguments/adjuncts • Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2)
First Experiment (Siglex99) • WSJ 5K word corpus • running text • WordNet 1.6 • 2100 words sense tagged twice (10 days) • 89% inter-annotator agreement • 700 verb tokens – 81% agreement (disagreement in 90/350 verb tokens) • Automatic predicate-argument labeling • 81% precision on 162 structures • Hand corrected 2100 words in one day
Example • I was shaking the whole time. <arg0> <WN2> <temporal> • The walls shook; the building rocked. <arg1> <WN3>; <arg1> <WN1>
Predicate argument labels • Rosenzweig’s converter • Uses TreeBank “cues” • Consults lexical semantic KB • Verb subcategorization frames and alternations • Ontology of noun-phrase referents • Multi-word lexical items
Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter
Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter
Second Experiment: Methodology(with Christiane Fellbaum) • Sense tagging of 150K • Two human annotators (replace one with automatic WSD if possible) • WordNet senses, but allow for revision of entries • Automatic predicate-argument labeling • hand correction, lexicon for reference • Standoff XML annotation
Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks
Korean Morphological Analyzer (POStags) Parser/Generator TreeBank Companion pred-arg lexicon English POStagger Parser/Generator TreeBank Companion pred-arg lexicon Korean/English MT Components Transfer Lexicon
Transfer lexicon entries: Mapping predicate argument structures across languages
Korean/English MT Chunghye Han, Juntae Yoon, Meesook Kim, Eonsuk Ko(CoGenTex/Penn/Systran: ARL) • Parallel TreeBanks for Korean/English enable • Training of domain-specific Korean parsers • Collins parser and SuperTagger (also English) • Alignment of Korean/English structures • Attempt automatic and semi-automatic testing and generation of transfer lexicon (with CoGenTex) • Apply statistical MT techniques • Lexical semantics (Systran, mapped to EuroWordNet-IL) should improve • Accuracy of parsers • Recovery of dropped arguments • http://www.cis.upenn.edu/~xtag/koreantag/index.html
Chinese TreeBank – DODFei Xia, Ninwen Xue, Fu-dong Chiouhttp://www.ldc.upenn.edu/ctb/index.html • Workshop of interested members of Chinese community, June ‘98 • Guidelines and sample files posted on web • Segmentation, March, ‘99 • POStagging, March, ‘99 • Bracketing, First pass, October, ’99 • Bracketing, Second Pass, May, ’00 • 95%+ inter-annotator consistency • Release of 100K annotated data, July, ’00 • Follow-up workshop, Hong Kong, ACL’00
Goal for Chinese • Parallel, annotated corpora – translate CTB • Parse English with WSJ trained parsers, correct • Extend English TreeBank lexicon as needed • Parse Chinese with CTB trained parsers, correct • Start with lexicon extracted from CTB, extend Experiment with using semi-automated techniques wherever possible to speed up process
Conclusion • Corpus annotations can be efficiently and reliably enriched with semantics • Companion lexicons can be derived from them Challenge– Parallel corpora annotated with predicate-argument structures will improve statistical MT. Prove me wrong?