230 likes | 241 Views
This project aims to provide a corpus of annotated parallel and comparable Danish and Italian texts, exploring the systematic relation between various types of referring expressions and different discourse transition states. It aims to identify similarities and differences in the use of referring expressions in Danish and Italian.
E N D
Centre for Language Technology Co-referential chains and discourse topic shifts in parallel and comparable corpora Costanza Navarretta costanza@hum.ku.dk
Centre for Language Technology Outline • Motivation • Preceding studies/projects • Background • The data • The annotation • Problems • Some results
Centre for Language Technology Motivation • to provide a corpus of parallel and comparable Danish and Italian texts annotated with (co)-reference and with discourse topic shifts (language studies,anaphora resolution, MT, generation) • to investigate whether there is a systematic relation between the use of various types of referring expression and different discourse transition states in the two languages • to individuate similarities and differences in the use of various referring expressions in Danish and Italian
Centre for Language Technology Previous work • Study on the use and resolution of pronouns in Danish • MULINCO project (Maegaard et al. 2006) • DAD project (Navarretta & Olsen 2008) • Annotation seminar at University of Copenhagen (september 2008)
Centre for Language Technology Things to be inquired • Referring expressions are used differently in English, Danish and Italian (theoretic and practical problems) • Differences in the way the three languages use various types of pronoun in abstract reference • Impression that Danish and Italian use different strategies in reference especially in relation to topic shifts
Centre for Language Technology Background: relation between reference and discourse structure • Kuno 1972, Halliday and Hasan 1976 • Hobbs 1982 (coherence relations + reference resolution in an abductive framework) • Givón 1983 (major and minor junctures in dialogue transcriptions ) • Cristea et al. 1998: Veins Theory inside Rhetorical Structure Theory (Mann and Thompson 1987)
Centre for Language Technology Background - continued • Centering framework (Grosz et al. 1995): • presupposes global coherence: Grosz and Sidner 1986 • is about local coherence • mainly regards pronouns • compatible with cognitive models of reference of nominal expressions, i.a. (Givón 1983, Gundel et al. 1993, Prince 1981): use of referring expressions reflects the assumption made by speakers about the addressees’ mental state at that point in discourse
Centre for Language Technology Local transitions (Brennan et al. 1987, Fais 2004, Poesio et al. 2004) Presence/absence of backward-looking center (Cb) Nature of instantiation of discourse entities
Centre for Language Technology Background: Salience and nominal expressions • in focus > activated> familiar> uniq. identifiable>type ident. • it that that N the N a N • this this N • Gundel et al. (1993) • zero pronouns < cliticized pronouns < unstressed pronouns < • stressed pronouns < stressed pronouns + gestures <proximal • demonstrative < distal demonstratives < first name or last • name < definite description < full name • Ariel (1988, 1994)
Centre for Language Technology The annotated corpora • Parallel corpora • European law texts • Short stories and translations (Pirandello) • short stories and translations (Villy Sørensen?) • Comparable corpora • Financial newspapers (Il Sole 24 Ore, Børsen) • Newspaper articles • until now: • approx. 24,000 words for Italian • approx. 19,000 words for Danish
Centre for Language Technology The annotation • (Co)reference • Annotation of (co)reference by substantives added on a small subset of the DAD corpus (annotated with pronominal abstract anaphora and 3rd person singular neuter pronouns) • Annotators: Italian (6 on a first subset of the data than divided in groups of 2) • Annotators: Danish (4 then 2)
Centre for Language Technology The annotation – continued • builds upon the MATE/GNOME annotation (Poesio 2004) • includes both reference to objects introduced in discourse by nominal phrases and reference to objects introduced by i.a. verbal phrases, clauses, discourse segments, predicates in copula constructions
Centre for Language Technology The annotation continued • function of pronouns (pleonastic, cataphoric, deictic, anaphoric, individual, abstract, textual deictic, vague, abandoned ) • information about type of referring expression (type of NP, see also Poesio et al. 2004) • type of relation between referring expression and antecedent (identity/non-identity/other?) • syntactic type of antecedent (e.g. type of clause, discourse segment, other…) • semantic type of abstract referents (Asher 1993, Gundel et al. 2003, Navarretta & Olsen 2008)
Centre for Language Technology Some problems • pronouns (referring expressions in general) can be multifunctional (anaphoric and cataphoric) • definition of clauses in Danish and Italian • reference relations: non-identity too general– antecedent and referring expression related or context determining semantic difference of referents • possessives • granularity of semantic types • direct speech, deictic I, you…
Centre for Language Technology Discourse topics • Global level (all files): • paragraphs are considered to be starting a topic, then subtopic and subsubtopic (Rocha 1997) • local level (only part of the data): continue/retain/smooth shift/rough shift
Centre for Language Technology The annotation schemes • two slightly different annotation schemes, the Italian scheme accounting for zero anaphora (Italian is a subject pro-drop language), clitic pronouns, reference to PP • de, seg elements as in MATE/GNOME • added explet, abandoned, chunk • added seg1 for clitics and zero-anaphora (Italian) • added a number of extra attributes • tool PALinkA (Orasan 2003): anchor+ref substituted by link (attributes identity/non-identity and dislink to annotate discontinuous elements)
Centre for Language Technology Interannotator agreement: Italian • 6 annotators on the first 4000 words • weighed kappa statistics (Cohen, 1968): PRAM http://www.geocities.com/skymegsoftware/pram.html • In-between 0.75 (abstract reference by NPs) and 0.95 • On rest of the data varying agreement (depend on annotators, data etc) • Humans are not machines: a number of referring expressions are “forgotten” by 1 or both annotators, other distraction errors.
Centre for Language Technology • <P id="p35" topic="t35.1"> • <S id="s35.1"> <transition ttype="TNULL“/> • <de ID="n173" syn-type="NPR"> • <link Ltype="ident" POINT-BACK="n172"/> • <W id="w35.1.1">La</W><W id="w35.1.2">Acqua</W><W id="w35.1.3">Marcia</W> • </de> • <W id="w35.1.4">può</W><W id="w35.1.5">evitare</W> • <de ID="n521" syn-type="DNP"> • <W id="w35.1.6">il</W><W id="w35.1.7">fallimento</W> • </de> • <W id="w35.1.8">.</W> • </S> • <S id="s35.2"> <transition ttype=“CONTINUE“/> • <de ID="n174" syn-type="DNP+GP"> • <link Ltype="ident" POINT-BACK="n173"/> • <W id="w35.2.1">La</W><W id="w35.2.2">finanziaria</W> <W id="w35.2.3">di</W> • <de ID="n522" syn-type="NPR"> • <W id="w35.2.4">Vincenzo</W> <W id="w35.2.5">Romagnoli</W> • </de></de>.... • </S>...</P>
Centre for Language Technology First results – genre differences • (co)referential chains in literary texts much longer than in the financial articles where coherence is often given by domain knowledge • pronouns more frequently used in literary texts • distance between referring expression and antecedent extremely high in literary texts (is there coreference when the distance is more than 50 clauses?)
Centre for Language Technology Differences between the two languages • Inferable entities are more often anchored to known entities by genitives in Danish • Fin dal primo giorno, Bartolino Fiorenzo s’era sentito dire dalla promessa sposa… • Fra første dag havde Bartolino Fiorenzo hørt sin tilkommende sige… • (From the very first day Bartolino Fiorenzo had heard (his/the) fiancée say) • Pirandello La buona anima
Centre for Language Technology Differences in use of proximal/distal demonstrative + N' • Italian quel/quello/quella (that) + N' used if: • there are other clauses or nominals inbetween referring expression and antecedent • there is a temporal or spatial distance from the antecedent • Danish denne (this) + N': in the same contexts (there can be clauses and nominals inbetween, but no competing antecedents, i.e. antecedents of the same semantic type) • quella donna/denne kvinde (woman) • quella sciagura/denne ulykke (calamity) • quella gioia/denne glæde (joy) • questo ragionamento/dette argument (this argument/this reasoning) when the antecedent is the immediately preceding discourse segment
Centre for Language Technology Transition states and referring expressions - Italian • Continue: Zero> Pronoun> the N • Retain: Pronoun > Proper Name >…>Zero • Smooth Shift: Proper Name > the N >Pronoun • Rough Shift: the N > genitive N > Proper Name (+ the N)> distal N >a N >Pronoun • NULL: Proper name > the N
Centre for Language Technology Conclusion • A lot of work to be done • A lot of aspects which we have not “knowledge” of • Language differences which must be accounted for • Important for both theoretical studies of language • Rule-based systems • Machine learning