120 likes | 241 Views
Annotating the HKCSE Pragmatically. Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies. mail: weissermar@gmail.com. web: martinweisser.org. Outline. The Conversion Process Pre-processing Requirements Annotation & Post-processing
E N D
Annotating the HKCSE Pragmatically Martin WeisserVisiting ProfessorSchool of English and EducationGuangdong University of Foreign Studies mail: weissermar@gmail.com web: martinweisser.org
Outline • The Conversion Process • Pre-processing Requirements • Annotation & Post-processing • Searching & Exploring the Corpus • Conclusion
The Conversion Process I – Issues • how to convert to DART XML format? • identify original conventions • some documented in Cheng et al. (2008) • some undocumented • use tone unit marking? • unfortunately tone units in Brazil’s system for ‘discourse intonation’ ≠ C-units • → no ‘sentence’ intonation inferable directly • remove prosodic information, apart from stress and tone movements, to ensure readability • handle overlap • exact extent not marked or inferable • → better to delete • etc.
The Conversion Process III – the Conversion Editor save output original input file conversion resultview conversion script editor
The Conversion Process IV – Conversion Results converted to DART XML format retained stress marking converted & moved tone marking converted ‘non-speech’ to comments added gender attribute added speaker type attribute moved pauses to next turn
Pre-processing Requirements • creating new resources in/for DART • adapt DART modules to handle mixed case • ‘synthesise’ domain-specific lexicon • create domain-specific topic ‘thesaurus’ • pre-processing • fix conversion errors • identify/mark incomplete words • split turns • add punctuation, partly based on original prosodic features • etc.
Annotation & Post-processing I –Steps • annotation in DART • fully automated • less than 80 sec for • 24 files • ~72,100 words • ~10,300 C-Units • Post-processing to fix potential errors on the levels of • syntax: potentially missing syntax rules • pragmatics: missing inferencing rules or modes (‘IFIDs’) • semantics: incorrectly identified topics
Annotation & Post-processing II –Annotation Result automatically split off DM identifiedsyntacticcategory annotated identifiablespeech acts
Searching the Corpus • easily searchable viaDART • speech act stats hyperlinked to concordancer • formulaic patterns or disfluencies via n-grams • manual searches in concordancer for specific • speech acts • syntactic categories + speech acts • speech acts + speaker types • speech acts + gender • responses to questions • searches for specific tone features
Conclusion • DART annotation enriches the HKCSE through • adding syntactic and pragmatic annotation • ability to analyse features based on (functional) C-units, rather than intonation units • new search options based on the above features
References • Cheng, W. Greaves, C. and Warren, M. 2008. A Corpus-driven Study of Discourse Intonation: the Hong Kong Corpus of Spoken English (prosodic). Amsterdam/Philadelphia: John Benjamins. • Weisser, M. 2010. Annotating Dialogue Corpora Semi-Automatically: a Corpus-Linguistic Approach to Pragmatics. Unpublished Habilitation (professorial) thesis, University of Bayreuth. • Weisser, M. 2012; forthcoming 2014. Pragmatic annotation. In: Aijmer, K. & Rühlemann, C. (Eds.). Corpus Pragmatics: a Handbook. Cambridge: CUP. • Weisser, M. 2014. The DART Manual. • Weisser, M. (in progress). DART – the Dialogue Annotation and Research Tool.