720 likes | 880 Views
Exploring PropBanks for English and Hindi. Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder. Why is semantic information important?. Imagine an automatic question answering system Who created the first effective polio vaccine? Two possible choices:
E N D
Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder
Why is semantic information important? • Imagine an automatic question answering system • Who created the first effective polio vaccine? • Two possible choices: • Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine • The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh
Question Answering • Who created the first effective polio vaccine? • Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine • The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh
Question Answering • Who created the first effective polio vaccine? • [Becton Dickinson] created the [first disposable syringe]for use with the mass administration of the first effective polio vaccine • [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh
Question Answering • Who created the first effective polio vaccine? • [Becton Dickinsonagent] created the [first disposable syringetheme]for use with the mass administration of the first effective polio vaccine • [The first effective polio vaccinetheme] was created in 1952 by [Jonas Salkagent] at the University of Pittsburgh
Question Answering • We need semantic information to prefer the right answer • The theme of create should be ‘the first effective polio vaccine’ • The theme in the first sentence was ‘the first disposable syringe’ • We can filter out the wrong answer
We need semantic information • To find out about events and their participants • To capture semantic information across syntactic variation
Semantic information • Semantic information about verbs and participants expressed through semantic roles • Agent, Experiencer, Theme, Result etc. • However, difficult to have a standard set of thematic roles
Proposition Bank • Proposition Bank (PropBank) provides a way to carry out general purpose Semantic role labelling • A PropBank is a large annotated corpus of predicate-argument information • A set of semantic roles is defined for each verb • A syntactically parsed corpus is then tagged with verb-specific semantic role information
Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels
Proposition Bank • The first (English) PropBank was created on a 1 million syntactically parsed Wall Street Journal corpus • PropBank annotation has also been done on different genres e.g. web text, biomedical text • Arabic, Chinese & Hindi PropBanks have been created
English PropBank • English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003) • Added a layer of predicate-argument information to the Penn Treebank • Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus • Amenable to collecting representative statistics
English PropBank Annotation • Two steps are involved in annotation • Choose a sense ID for the predicate • Annotate the arguments of that predicate with semantic roles • This requires two components: frame files and PropBanktagset
PropBank Frame files • PropBank defines semantic roles on a verb-by-verb basis • This is defined in a verb lexicon consisting of frame files • Each predicate will have a set of roles associated with a distinct usage • A polysemous predicate can have several rolesets within its frame file
An example • John rings the bell
An example • John rings the bell • Tall aspen trees ring the lake
An example • [John] rings [the bell] • [Tall aspen trees] ring [the lake] Ring.01 Ring.02
An example • [JohnARG0] rings [the bellARG1] • [Tall aspen treesARG1] ring [the lakeARG2] Ring.01 Ring.02
Frame files • The Penn Treebank had about 3185 unique lemmas (Palmer, Gildea, Kingsbury, 2005) • Most frequently occurring verb: say • Small number of verbs had several framesets e.g. go, come, take, make • Most others had only one frameset per file
English PropBankTagset • Numbered arguments Arg0, Arg1, and so on until Arg4 • Modifiers with function tags e.g. ArgM-LOC (location) , ArgM-TMP (time), ArgM-PRP (purpose) • Modifiers give additional information about when, where or how the event occurred
PropBanktagset • Correspond to the valency requirements of the verb • Or, those that occur with high frequency with that verb
PropBanktagset • 15 modifier labels for English PropBank • [HeArg0] studied [economic growthArg1] [in IndiaArgM-LOC]
PropBanktagset • Verb specific and more generalized • Arg0 and Arg1 correspond to Dowty’s Proto Roles • Leverage the commonalities among semantic roles • Agents, causers, experiencers – Arg0 • Undergoers, patients, themes- Arg1
PropBanktagset • While annotating Arg0 and Arg1: • Unaccusative verbs take Arg1 as their subject argument • [The windowArg1] broke • Unergatives will take Arg0 • [JohnArg0] sang • Distinction is also made between internally caused events (blush: Arg0) & externally caused events (redden: Arg1)
PropBanktagset • How might these map to the more familiar thematic roles? • Yi, Loper & Palmer (2007) describe such a mapping to VerbNet roles
More frequent Arg0 and Arg1 (85%) are learnt more easily by automatic systems • Arg2 is less frequent, maps to more than one thematic role • Arg3-5 are even more infrequent
Using PropBank • As a computational resource • Train semantic role labellers (Pradhan et al, 2005) • Question answering systems (with FrameNet) • Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005) • For linguists, to study various phenomena related to predicate-argument structure
Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels
Developing PropBank for Hindi-Urdu • Hindi-Urdu PropBank is part of a project to develop a Multi-layered and multi-representational treebank for Hindi-Urdu • Hindi Dependency Treebank • Hindi PropBank • Hindi Phrase Structure Treebank • Ongoing project at CU-Boulder
Hindi-Urdu PropBank • Corpus of 400,000 words for Hindi • Smaller corpus of 150,000 words for Urdu • Hindi corpus consists of newswire text from ‘AmarUjala’ • So far.. • 220 verb frames • ~100K words annotated
Developing Hindi PropBank • Making a PropBank resource for a new language • Linguistic differences • Capturing relevant language-specific phenomena • Annotation practices • Maintain similar annotation practices • Consistency across PropBanks
Developing Hindi PropBank • PropBank annotation for English, Chinese & Arabic was done on top of phrase structure trees • Hindi PropBank is annotated on dependency trees
Dependency tree • Represent relations that hold between constituents (chunks) • Karaka labels show the relations between head verb and its dependents दिये gave k1 k2 k4 रामने Raam erg पैसे money औरतको woman dat
Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling
Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling • Both frame creation and labelling require new strategies for Hindi
Hindi PropBank • Hindi frame files were adapted to include • Morphological causatives • Unaccusative verbs • Experiencers • Additionally, changes had to be made to analyze the large number (nearly 40%) of light verbs
Causatives • Verbs that are related via morphological derivation are grouped together as individual predicates in the same frame file. • E.g. Cornerstone’s multi-lemma mode is used for the verbs खा (KA ‘eat’, खिला (KilA ‘feed’ andखिलवा(KilvA‘cause to feed’)
Causatives raam neArg0 khaanaArg1khaayaa Ram erg food eat-perf ‘Ram ate the food’ mohan neArg0raam koArg2 khaanaArg1khilaayaa Mohan erg Ram dat food eat-caus-perf ‘Mohan made Ram eat the food’ sitaneArgCmohanseArgAraam koArg2 khaanaArg1khilvaayaa Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf ‘Sita, through Mohan made Ram eat the food ’
Unaccusatives • PropBank needs to distinguish proto agents and proto patients • Sudha danced – • Unergative, animate agentive arguments- Arg0 • The door opened • Unaccusative, non animate patient arguments- Arg1 • For English, these distinctions were available in VerbNet • For Hindi, various diagnostic tests are applied to make this distinction
First stage Yes No- Second Stage Ergative. E.g. naac, dauRa, bEtha? No-Third stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa Yes.Unacc-usative. Eliminate bEtha No. Take the majority vote on the tests. Unaccusativity diagnostics (5) Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects cognate object & ergative case tests • For non-alternating verbs and others that remain, • the verb will be unaccusative if: • Impersonal passives are not possible • use of ‘huaa’ (past participial relative) is possible • the inanimate subject appears without overt genitive
Unaccusativity • This information is then captured in the frame
Experiencers • Arg0 includes agents, causers, experiencers • In Hindi, the experiencer subjects occur with dikhnaa (to glimpse), milnaa (to find), suujhnaa (to be struck with) etc. • Typically marked with dative case • Mujh-kochaanddikhaa I. Dat moon glimpse.pf (I glimpsed the moon) • PropBank labels these as Arg0, with an additional descriptor field ‘experiencer’ • Internally caused events
Complex predicates • A large number of complex predicates in Hindi • Noun-verb complex predicates • vichaarkarna; think do (to think) • dhyaandenaa; attention give (to pay attention) • Adjectives can also occur • acchalagnaa: good seem (to like) • the predicating element is not the verb alone, but forms a composite with the noun/adj
Complex predicates • There are also a number of verb-verb complex predicates • Combine with the bare form of the verb to convey different meanings • ropaRaa; cry lie- burst out crying • Add some aspectual meaning to the verb • As these occur with only a certain set of verbs, we find them automatically, based on part of speech tag
Complex predicates • However, the noun-verb complex predicates are more pervasive • They occur with a large number of nouns • Frame files for the nominal predicate need to be created • E.g. in a sample of 90K words, the light verb kar; ‘do’ occurred 1738 times with 298 different nominals
Complex predicates • Annotation strategy • Additional resources (frame files) • Cluster the large number of nominals?
Light verb annotation • A convention for the annotation of light verbs has been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010) • Annotation is carried out in three passes • Manual identification of light verb • Annotation of arguments based on frame file • Deterministic merging of the light verb and ‘true’ predicate • In Hindi, this process may be simplified because of the dependency label ‘pof’ that identifies a light verb
Identify the N+V sequences that are complex predicates. Annotate predicating expression with ARG-PRX. REL: kar ARG-PRX: corii Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese
Hindi PropBanktagset 24 labels