Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania

Sense Tagging the Penn TreeBankMartha PalmerCIS and IRCSUniversity of Pennsylvania John's Hopkins UniversityApril 18, 2000

Penn approach • Relies on lexically based linguistic analysis • Humans annotate naturally occurring text • (hand correct output of automatic parsers, e.g. Fiddich, XTAG) • Train statistical POStaggers, parsers, etc. • Common thread is predicate-argument structure Hypothesis: More linguistically sophisticated analyzers More accurate output

Past results • XTAG project http://www.cis.upenn.edu/~xtag/ • Penn TreeBank http://www.cis.upenn.edu/~treebank/ • Enabled the development of tools: POStaggers, parsers, co-reference, etc http://www.ircs.upenn.edu/knowledge/licensing.html

Under TIDES: • Annotations enriched with semantics and pragmatics • Provide companion lexicons for annotated corpora • Extend our coverage to other languages (Chinese, Korean) Hypothesis– Parallel annotated corpora/lexicons will enable rapid ramp-up of MT

Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks

Machine Translation Lexical Choice- Word Sense Disambiguation • Iraq lost the battle. • Ilakuka centwey ciessta. • [Iraq ] [battle] [lost]. • John lost his computer. • John-i computer-lul ilepelyessta. • [John] [computer] [misplaced].

Cross-linguistic Information Retrieval-sense ambiguities • English • speed of light > *plea bargaining, • (speedier trials, lighter sentences) • Multilingual: French => English • saisi stupefiant > drug seizure, *grip narcotic, *understand stupefying,...

lose OBJ SUBJ Predicate-argument structures for lose • lose1 (Agent: animate, • Patient: physical-object) • lose2 (Agent: animate, • Patient: competition) • Agent <=> subj • Patient <=> obj • ACL-81, ACL-85,ACL86,MT90,CUP90,AIJ93

Word sense disambiguation withSource Language Semantic Class Constraints(co-occurrence patterns + backoff) lose1(Agent, Patient: competition) <=> ciessta lose2 (Agent, Patient: physobj) <=> ilepelyessta

Word sense disambiguation with Target Language Semantic Constraints receive <=> {patassta, swusinhayssta} patassta(Recipient, Patient: physical-obj) swusinhayssta (Recipient, Patient: communication) TAG+94, CSLI99,AMTA94

break smash shatter snap ? da po - irregular pieces da sui - small pieces pie duan -line segments Lexical Gaps: English to Chinese

Word sense disambiguation with Source Language Neighbors and Target Language Semantic Class Constraints break {smash, shatter, snap, etc.} <=> {da sui, da po, pie duan, di po,...} da sui (Agent, Patient: small and brittle) da po (Agent, Patient: concrete, inflexible object) pie duan(Agent, Patient: line segment shape) ACL94, MTJ95

Levin classes (3100 verbs) • 47 top level classes, 150 second and third level • Based on pairs of syntactic frames. • John broke the jar. / Jars break easily. / The jar broke. • John cut the bread. / Bread cuts easily. / *The bread cut. • John hit the wall. / *Walls hit easily. / *The wall hit. • Reflect underlying semantic components • contact, directed motion, • exertion of force, change of state • Synonyms, syntactic patterns, relations

Confusions in Levin classes? • Not semantically homogenous • {braid, clip, file, powder, pluck, etc...} • Multiple class listings • homonymy or polysemy? • Alternation contradictions? • Carry verbs disallow the Conative, but include • {push,pull,shove,kick,draw,yank,tug} • also in Push/pull class, does take the Conative

Intersective Levin classes

Regular Sense Extensions • John pushed the chair. +force, +contact • arg0 arg1 • John pushed the chairs apart. +ch-state • John pushed the chairs across the room. +ch-loc • John pushed at the chair. -ch-loc • The train whistled into the station. +ch-loc • The truck roared past the weigh station. +ch-loc AMTA98,ACL98,TAG98

Intersective Levin Classes • More syntactically and semantically coherent • sets of syntactic patterns • explicit semantic components • relations between senses • VERBNET

VerbNet: Push

Manner of Motion Verbs: • Roll verbs • The ball rolled (down the hill.) • Down the hill rolled the ball. • Bill rolled the ball down the hill. • The ball rolled free. • The ball rolled 3 feet.

Manner of Motion Verbs: • Run verbs • The horse jumped (over the stream). • The horse jumped the stream. • Over the stream jumped the horse. • The rider jumped the horse over the stream. • The horse jumped himself into a lather. • The horse jumped five feet. • He made/went for a jump.

Manner of Motion Verbs: • Roll verbs • The ball rolled down the hill. arg1 argP • Bill rolled the ball down the hill. arg0 arg1 argP • Run verbs • The horse jumped. arg0 • The rider jumped the horse over the stream. argA arg0 argP

Levin classes involving Motion verbs

Portuguese – similar patterns,except for… • Same • Bounce/ rebater • Float/ flutuar • Roll/rolar • Slide/deslizar • Different (no causative) • Drift/derivar • Planar/glide

Machine Translation - Head switching • The log floated into the cave. • A madeira entrou na caverna flutuando. • [log] [entered] [cave] [floating]

Treatment for Head Switching uses: • Cross-linguistic generalizations • based on • Intersective Levin classes • implemented in • Feature-based Lexicalized Tree-Adjoining Grammars

Partial derivations for Head Switching - STAG Transfer Lexicon AMTA96, KLUWER98

Preparing Training Data • WordNet - online lexical resource • ISA relations, Part-whole, synonym sets • Poor inter-annotator agreement, 70-80% - lose • No predicate-argument structures or constraints • SENSEVAL/SIGLEX98: (Brighton, Sep,98) • Workshop on Word Sense Disambiguation • 34 words, corpus-based sense inventory • Inter-annotator agreement over 90%

8 major distinctions 3 Shake up 3 Shake down 2 Shake out 2 Shake off 8 major distinctions 2 shake up 2 shake off ShakeHector vs. WordNet

Mismatches between lexicons:Hector - WordNet

VERBNET

Approach • Revising WordNet with VerbNet • corpus-based sense distinctions • explicit predicate-argument structure and constraints • Provides more coarse-grained sense distinctions for easy mapping to other lexical resources • Sense tagging Penn TreeBank

Semantic Annotation –Hoa Dang, Joseph Rosenzweig, John Duda • Current syntactic annotation • POS, phrase structure bracketing • Logical Subject, locative, temporal adjuncts • New semantic augmentations • Sense tag verbs and noun arguments/adjuncts • Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2)

First Experiment (Siglex99) • WSJ 5K word corpus • running text • WordNet 1.6 • 2100 words sense tagged twice (10 days) • 89% inter-annotator agreement • 700 verb tokens – 81% agreement (disagreement in 90/350 verb tokens) • Automatic predicate-argument labeling • 81% precision on 162 structures • Hand corrected 2100 words in one day

Example • I was shaking the whole time. <arg0> <WN2> <temporal> • The walls shook; the building rocked. <arg1> <WN3>; <arg1> <WN1>

Predicate argument labels • Rosenzweig’s converter • Uses TreeBank “cues” • Consults lexical semantic KB • Verb subcategorization frames and alternations • Ontology of noun-phrase referents • Multi-word lexical items

Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter

Second Experiment: Methodology(with Christiane Fellbaum) • Sense tagging of 150K • Two human annotators (replace one with automatic WSD if possible) • WordNet senses, but allow for revision of entries • Automatic predicate-argument labeling • hand correction, lexicon for reference • Standoff XML annotation

Example translation

Korean Morphological Analyzer (POStags) Parser/Generator TreeBank Companion pred-arg lexicon English POStagger Parser/Generator TreeBank Companion pred-arg lexicon Korean/English MT Components Transfer Lexicon

Transfer lexicon entries: Mapping predicate argument structures across languages

Korean/English MT Chunghye Han, Juntae Yoon, Meesook Kim, Eonsuk Ko(CoGenTex/Penn/Systran: ARL) • Parallel TreeBanks for Korean/English enable • Training of domain-specific Korean parsers • Collins parser and SuperTagger (also English) • Alignment of Korean/English structures • Attempt automatic and semi-automatic testing and generation of transfer lexicon (with CoGenTex) • Apply statistical MT techniques • Lexical semantics (Systran, mapped to EuroWordNet-IL) should improve • Accuracy of parsers • Recovery of dropped arguments • http://www.cis.upenn.edu/~xtag/koreantag/index.html

Chinese TreeBank – DODFei Xia, Ninwen Xue, Fu-dong Chiouhttp://www.ldc.upenn.edu/ctb/index.html • Workshop of interested members of Chinese community, June ‘98 • Guidelines and sample files posted on web • Segmentation, March, ‘99 • POStagging, March, ‘99 • Bracketing, First pass, October, ’99 • Bracketing, Second Pass, May, ’00 • 95%+ inter-annotator consistency • Release of 100K annotated data, July, ’00 • Follow-up workshop, Hong Kong, ACL’00

Goal for Chinese • Parallel, annotated corpora – translate CTB • Parse English with WSJ trained parsers, correct • Extend English TreeBank lexicon as needed • Parse Chinese with CTB trained parsers, correct • Start with lexicon extracted from CTB, extend Experiment with using semi-automated techniques wherever possible to speed up process

Conclusion • Corpus annotations can be efficiently and reliably enriched with semantics • Companion lexicons can be derived from them Challenge– Parallel corpora annotated with predicate-argument structures will improve statistical MT. Prove me wrong?

Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania

Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania

Presentation Transcript

PENN DISCOURSE TREEBANK: COMPLEXITY OF DEPENDENCIES AT THE DISCOURSE LEVEL AND AT THE SENTENCE LEVEL

Penn Discourse Treebank PDTB 2.0

Penn Discourse Treebank PDTB 2.0

University of Pennsylvania

English and Chinese PropBanks Martha Palmer University of Pennsylvania

University of Pennsylvania:

A Brief History of the Penn Treebank

The Pennsylvania State University

IRCS-HRU Subaru IRCS High Resolution Unit

ACCN - Penn State University Northern Appalachia Pennsylvania and New York

Martha Butler The Pennsylvania State University Department of Meteorology April 24, 2007

University of Pennsylvania

University of Pennsylvania

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank

Layering of Annotations in the Penn Discourse TreeBank (PDTB)

Conversion of Penn Treebank Data to Text

University of Pennsylvania

Pennsylvania William Penn and The Lenni Lenape Native Americans

University of Pennsylvania

Indiana University Of Pennsylvania

Pennsylvania William Penn and The Lenni Lenape Native Americans