390 likes | 603 Views
Semi-automatic Annotation of the Romanian TimeBank 1.2. Corina Forăscu , Radu Ion, Dan Tufi ş Faculty of Computer Science, Al.I. Cuza University of Ia s i, Romania & Research Institute for Artificial Intelligence of the Romanian Academy corinfor@info.uaic.ro , {radu , tufis}@racai.ro.
E N D
Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu, Radu Ion, Dan Tufiş Faculty of Computer Science, Al.I. Cuza University of Iasi, Romania & Research Institute for Artificial Intelligence of the Romanian Academy corinfor@info.uaic.ro , {radu, tufis}@racai.ro
Outline Fundamentals TimeML & TimeBank Corpus processing translation pre-processing Alignment Annotation import Conclusions
Fundamentals Temporal information in Natural Language: Time-denoting expressions – references to a calendar or clock system expressed by NPs, PPs, or AdvPs the 23rd of May, 1998; Monday; tomorrow; the second semester Event-denoting expressions - reference to an event expressed by sentences – more precisely their syntactic head, the main verb: John listens to the music. noun phrases: Israel will ask the USA to delay a military strike against Iraq.
Motivation (1) NLP applications to benefit: lexicon induction, linguistic investigation, using very large annotated corpora; question answering (questions like when, how often or how long); information extraction or information retrieval; machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language); discourse processing: temporal structure of discourse and summarization.
Motivation (2) Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today.
Motivation (3) • <TIMEX3 temporalFunction="true" tid="t152" type="TIME" value="PRESENT_REF">Acum</TIMEX3> îsi <EVENT aspect="PROGRESSIVE" class="OCCURENCE" eid="e153" tense="PAST">dădea</EVENT><MAKEINSTANCE eiid="ei59" eid="e153" cardinality="1" /> seama <SIGNAL sid="s154">ca</SIGNAL> tocmai din cauza acestui <EVENT aspect="NONE" class="OCCURENCE" eid="e156" tense="NONE">incident</EVENT><MAKEINSTANCE eiid="ei60" eid="e156" cardinality="1" /> se <EVENT aspect="PERFECTIVE" class="I_ACTION" eid="e157" tense="PAST">hotarâse</EVENT><MAKEINSTANCE eiid="ei61" eid="e157" cardinality="1" /> el brusc <SIGNAL sid="s54">sa</SIGNAL><EVENT aspect="NONE" class="OCCURENCE" eid="e159" tense="PRESENT">vină</EVENT><MAKEINSTANCE eiid="ei62" eid="e159" cardinality="1" /> acasa <SIGNAL sid="s160">si</SIGNAL><SIGNAL sid="s55">sa</SIGNAL> -si <EVENT aspect="NONE" class="ASPECTUAL" eid="e161" tense="PRESENT">înceapă</EVENT> <MAKEINSTANCE eiid="ei63" eid="e161" cardinality="1" /> jurnalul taman <TIMEX3 temporalFunction="true" tid="t162" type="DATE" value="1984-04-04">astăzi</TIMEX3> . • <TLINK eventInstanceID="ei59" relatedToTime="t152" relType="SIMULTANEOUS" /> • <TLINK eventInstanceID="ei60" relatedToEvent="e157" relType="BEFORE" /> Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi.
ACL-COLING WS: ARTE Annotating and Reasoning about Time and Events 2006 Time Symposium 2005 ACL 2005: TARSQI system ACE – TERN: TIMEX2 v.1.2. LREC 2002 Annotation Standards for Temporal Information in Natural Language 2002 2004 TERQAS: TimeML v.1.0. TARSQI: TimeML v.1.2. DAML-Time ACE – TERN: TIMEX2 v.1.1. 2001 STAG (Setzer) TIDES 2001: TIMEX2 v.1.0.2 ACL: Temporal and Spatial Information Processing 2000 TIMEX 1998 MUC 7 1947 Reichenbach: The tenses of verbs State of the Art
TERQAS 2002 • TimeML v.1.0 metadata standard for: • marking events, • their temporal anchoring and • links in news articles • TimeBank corpus v.1.0. • guidelines for temporal annotation
Outline Fundamentals TimeML & TimeBank Corpus processing translation pre-processing Alignment Annotation import Conclusions
TimeML v.1.2 • A metadata standard developed especially for news articles, for marking • Events: EVENT, MAKEINSTANCE • temporal anchoring of events: TIMEX3, SIGNAL • links between events and/or timexes: TLINK, ALINK, SLINK
situations that happen or occur, states or circumstances in which something obtains or holds true • tensed verbs, adjectives, nominalizations • The oat-bran crazee190 has coste189 the world's largest cereal maker market share. • 7 classes of EVENTs:OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_STATE, I_ACTION Events (1)
The oat-bran <class="OCCURRENCE"> crazee190</EVENT> has <class="OCCURRENCE">coste189</EVENT> the world's largest cereal maker market share. Analysts <class="REPORTING" >saye28</EVENT> much of Kellogg's <class="OCCURRENCE">erosione204</EVENT> has been in such core brands as Corn Flakes, ... Events (2)
Based on the event annotation: how many different instances or realizations has a given event – at least one • Carries the tense and aspect of the verb-denoted event • John learnse1twice on Monday. • <MAKEINSTANCE eiid=‘ei1’ eventID=‘e1’ signalID=‘s1’ cardinality=‘2’ aspect="NONE" tense="PRESENT"> Instances
Temporal expressions: TIMEX3 (1) • Explicit & implicit temporal expressions: • • Times: 11 o’clock; midnight • • Dates: • Fully Specified (May 23, 2006; winter, 2005), • Underspecified (Monday; next week; last month; two years ago) • • Durations: two months; three hours • • Sets: every week; every Tuesday
Temporal expressions: TIMEX3 (2) <TIMEX3 tid="t192" type="DATE" temporalFunction="false" functionInDocument="CREATION_TIME" value="1989-10-30" >10/30/89</TIMEX3> <TIMEX3 mod="APPROX" tid="t220" type="DURATION" temporalFunction="true" functionInDocument="NONE" value="P2Y" anchorTimeID="t192" >the next two years or so</TIMEX3> <TIMEX3 tid="t207" type="DATE" temporalFunction="true" functionInDocument="NONE" value="FUTURE_REF" anchorTimeID="t192" >soon</TIMEX3>
Temporal signals: SIGNAL • Function words that indicate how temporal objects are to be related to each other: • temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when • negative expressions • modal verbs • prepositions signaling modality (“to”) • special characters denoting ranges in temporal expressions: “-” and “/”
Temporal Relations:TLINK • Anchors to Time • Orders between Time and Events • Aspectual Relations:ALINK • Phases of an event • Subordinating Relations:SLINK • Events that syntactically subordinate other events Dependencies: LINKs
temporal relation between two temporal elements (event-event, event-timex); • EVENTs – through their INSTANCEs • 13 relTypes – as Allen’s: • Simultaneous • Identical • One before (/after) the other • One immediately before (+after) the other • One including / being included in the other • One holding during the duration of the other • One being the beginning (/ending) of the other • One being begun (/ended) by the other Temporal relations: TLINK (1)
ei1995 t192 ei1994 10/30/89 craze cost quit ei1996 Temporal relations: TLINK (2) The oat-bran crazee190/ei1994 has coste189/ei1995 the world's largest cereal maker market share. The company's president quit e3 /ei1996 suddenly.
ei1995 t192 ei1994 10/30/89 craze cost quit ei1996 Temporal relations: TLINK (3) <TLINK relatedToEventInstance="ei1995" eventInstanceID="ei1994" relType="BEFORE" /> <TLINK relatedToTime="t192" eventInstanceID="ei1996" relType="BEFORE" /> <TLINK relatedToEventInstance="ei1995" eventInstanceID="ei1996" relType="IS_INCLUDED" />
relationship between an aspectual event and its argument event: • Initiation:John started ei5toreadei6. • <ALINK eventInstanceID="ei5" relatedToEventInstance="ei6" relType="INITIATES"/> • Culmination: John finished ei5assemblingei6 the table. • <ALINK eventInstanceID="ei5“ relatedToEventInstance="ei6“ relType="TERMINATES"/> • Termination:John stoppedtalking. • Continuation: John kepttalking. Aspectual relations: ALINK
for contexts introducing relations between two events of type: • Modal:John should have bought some wine. • Factive:John forgot that he was in Boston yesterday. • Counterfactive:John prevented the divorce. • Evidential:John said he bought some wine. • Negative evidential:John denied he bought only beer. • Conditional:If John leaves today, Mary will cry. Subordination relations: SLINK
183 English news report documents TimeML annotated, distributed through LDC 4715 sentences with 10586 unique lexical units, from a total of 61042 lexical units Non-TimeML Markup in Time Bank 1.1: • structure information: header • named entity recognition: <ENAMEX>, <NUMEX>, <CARDINAL> • sentence boundary information: <s> TimeBank 1.2
events 7935 instances 7940 timexes 1414 signals 688 alinks 265 slinks 2932 tlinks 6418 TOTAL 27592 TimeBank 1.2
Outline Fundamentals TimeML & TimeBank Corpus processing translation pre-processing Alignment Annotation import Conclusions
Translation 2 “trained translators”; one final correction Translation desiderata: 1-1 sentence aligned Preserving POS Verb tense – mapped onto Romanian Format of the dates, moments of day and numbers conforms to the norms of written Romanian 4715 sentences (translation units), 65375 lexical tokens, including punctuation marks, representing 12640 lexical types
Preprocessing the corpus Tokenisation – MtSeg, with idiomatic expressions, clitic splitting POS-tagging – TnT adapted & improved to determine the POS of unknown words Lemmatisation – probabilistic, based on a lexicon Chunking – REs over POS tags to determine non-recursive NPs, APs, AdvPs, PPs
Alignment YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian-English track organized at ACL2005 Current: P = 88.80%, R = 74.83%, F = 81.22% 91714 alignments, manually checked, out of which 25346 are NULL-alignments
Alignment 1. Content words alignment: based on the translation lexicons P = 94.08%, R = 34.99%, F = 51.00%. 2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks; P = 89.90%, R = 53.90%, F = 67.40% 3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions 4. Correction phase: the wrong links introduced mainly in stage 3 are now removed.
<tu id="1"> • <seg lang="en"> • <s id="Timex.en.1"> • <w lemma="on_the_other_hand" ana="14+,ADVE" chunk="Ap#1">On_the_other_hand</w> • <c>,</c> • <w lemma="it" ana="13+,PPER3" chunk="Vp#1">it</w> • <w lemma="be" ana="3+,AUX3" chunk="Vp#1">'s</w> • <w lemma="turn" ana="1+,PPRE" chunk="Vp#1">turning</w> • <w lemma="out" ana="5+,PREP">out</w> • <w lemma="to" ana="15+,TO" chunk="Vp#2">to</w> • <w lemma="be" ana="1+,VINF" chunk="Vp#2">be</w> • <w lemma="another" ana="22+,PI">another</w> • <w lemma="very" ana="14+,ADVE" chunk="Ap#2">very</w> • <w lemma="bad" ana="1+,ADJE" chunk="Ap#2,Np#1">bad</w> • <w lemma="financial" ana="1+,ADJE" chunk="Ap#2,Np#1">financial</w> • <w lemma="week" ana="1+,NN" chunk="Np#1">week</w> • …</s> • </seg> • <tu id="1"> • <seg lang="ro"> • <s id="Timex.ro.1"> • <w lemma="pe_de_altă_parte" ana="14+,R" chunk="Ap#1">Pe_de_altă_parte</w> • <c>,</c> • <w lemma="sine" ana="12+,PXA" chunk="Vp#1">se</w> • <w lemma="dovedi" ana="1+,V3" chunk="Vp#1">dovedeşte</w> • <w lemma="a" ana="15+,QN" chunk="Vp#2">a</w> • <w lemma="fi" ana="1+,VN" chunk="Vp#2">fi</w> • <w lemma="alt" ana="22+,PI" chunk="Np#1">altă</w> • <w lemma="săptămână" ana="1+,NSRN" chunk="Np#1">săptămână</w> • <w lemma="financiar" ana="1+,ASN" chunk="Np#1,Ap#2">financiară</w> • <w lemma="foarte" ana="14+,R" chunk="Np#1,Ap#2">foarte</w> • <w lemma="prost" ana="1+,ASN" chunk="Np#1,Ap#2">proastă</w> • … • </s> </seg></tu> Alignment The parallel corpus = 183 files in XCES format
Annotation import Based on the Romanian-English lexical alignment
Annotation import For every pair of sentences Sro and Sen from the TimeBank parallel corpus with the Ten English equivalent sentence: 1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten. E = {<”In the”; 1,2>, <”Philippines”; 3>, <”, a”; 4,5>, <”four”; 6>, <”year”; 7>, <”low .”; 8,9>}.
Annotation import 2. add to every element of E the XML context in which that text fragment appeared in the original English TimeBank. E’ = {<”In the”; 1,2; s>, <”Philippines”; 3; s, ENAMEX>, …} 3. construct the list RW of Romanian words along with the transferred XML contexts using E’ and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered. RW = {<”În”; s>, <”Filipine”; s,ENAMEX>, …}.
Annotation import 4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format.
Annotation import Offline markup (MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian
Conclusions & future work improve & evaluate the annotation transfer adequacy of temporal theories to Romanian (semi) automatically mark-up of the temporal information in Romanian texts (news + literature)
Thank you! (Temporal) Questions???