90 likes | 262 Views
DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level. Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht. Coherence in discourse.
E N D
DiscAn: Towards a Discourse Annotation system for Dutch language corporaorwhy and how we would want to annotate corpora on the discourse level Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht
Coherence in discourse Many tourists come to Switzerland. They want to see the mountains. Referential coherence Many tourists come to Switzerland because they want to see the mountains. Relational coherence John was happy. It was a Saturday. We do not need explicit linguistic indicators
Coherence in discourse, 2 Coherence is a cognitive phenomenon Coherence relations are conceptual relations that constitute coherence between discourse segments (minimally clauses) Connectives, Cue Phrases and other lexical markers can but need not make this coherence explicit. Coherence relations are the building blocks of discourse structure (causal, contrastive, additive)
In annotated corpora ? The discourse level is largely lacking in annotated Dutch corpora There is an international tendency towards discourse annotation: • The Penn Discourse Treebank (Prasad, Joshi, Webber et al.) • The Potsdam Corpus (Stede et al.) And at the same time, we do have much data on Dutch: • on connectives • Mainly causal • Across media (various written genres, spoken, chat) • At various stages of annotation
Larger research issues in the field • To be answered on the basis of annotated corpora • The meaning and use of connectives varies across languages: omdat vs. parce que vs. weil • Semantic-pragmatic restrictions on use • Similarities and differences in acquisition • We will start discourse annotation with a study on the category of causals
Annotation Some criteria: Order: cause – consequence and vice versa Subjectivity: want, puisque, since, denn vs. omdat parce que, because weil Linguistic marking: yes/no, perspective etc. Characteristics of the segments: propositional attitude, modality, tense, syntax…
Current situation: 15 studies…. Corpus conn fragmnr s1 s2 modality s1 modality s2 protags1 s2 relation 7 omdat 2502 176 176 1 1 irrelevant want feit 6 1 1 1 Irrelevant want feit Irrelevant want feit 1 7 omdat 2502b 177 177 2 1 Spreker/auteur 6 2 1 1 Expliciet aanwezig Irrelevant want feit 1 7 omdat 2509 707 707 1 1 irrelevant want feit 6 1 1 1 Irrelevant want feit Irrelevant want feit 1 7 omdat 2539 3320 3320 1 1 irrelevant want feit 6 1 1 1 Irrelevant want feit Irrelevant want feit 1 7 omdat 2546 3810 3810 1 2 irrelevant want feit 33 2 3 1 Irrelevant want feit Impliciet 19 7 omdat 2551 4357 4357 1 2 irrelevant want feit 31 2 1 1 Irrelevant want feit Expliciet aanwezig 1 7 omdat 2525 2547 2547 3 1 Spreker/auteur 6 2 1 1 Expliciet aanwezig Irrelevant want feit 1
The DiscAn project has five main goals: • standardize and open up an existing set of Dutch corpus analyses of coherence relations and discourse connectives; • develop the foundations for a discourse annotation system; • improve the metadata by investigating existing CMDI profiles or adding new profiles suited for this type of analysis; • inventorize the required categories and investigate to what extent these could be included in ISOcat categories for discourse; • an interdisciplinary discourse community of text-, corpus and computational linguists to initiate further research in a European context.
A model of analysis • Var 1 Name of the coder (values: the names of the two authors) • Var 2 Number of the fragment (the values were present in the fragments) • Var 3 Utterance number(s) of the segment preceding want (S1) • Var 4 Utterance number(s) of the segment following want (S2) • Var 5 Propositional attitude of S1 (values: action, fact, opinion, observation, • knowledge, experience) • Var 6 Propositional attitude of S2 (values: action, fact, opinion, observation, • knowledge, experience) • Var 7 Identity of the conceptualizer in S1 (values: speaker/1st person, second person, • third person (nominal or pronominal, generic person) • Var 8 Identity of the conceptualizer in S2 (values: speaker/1st person, second person, • third person (nominal or pronominal, generic person) • Var 9 Type of relation expressed by want (values: non-volitional content, volitional • content, explanation of a mental state, epistemic, textual, speech act) • Var 10 Syntactic modification of want (values: no modification, coordinating • conjunction, intensifier, focus element)