Pragmatic Annotation & Analysis in DART

Pragmatic Annotation & Analysis in DART Martin WeisserSchool of English & EducationGuangdong University of Foreign Studiesweissermar@gmail.com martinweisser.org

Outline • Getting DART • Design Background • DART Annotation Scheme • Basic Automated Annotation • Speech-Act Analysis • N-Gram Analysis • Creating & Editing Resources

Getting DART • go to http://martinweisser.org/ling_soft.html#DART • download & run installer (currently 64bit Win only)

Design Background (1) • 1997–1998: Expert Advisory Group on Language Engineering Standards (EAGLES) WP4 • guidelines for the representation and annotation of dialogue • 2001–2002: SPAAC (A Speech-Act Annotated Corpus of Dialogues) Project • annotation of some 1,200 task-oriented dialogue files (Trainline + BT) • need to annotate and post-edit corpus efficiently and consistently on multiple levels  SPAACy

Design Background (2) colour coding helps to identify syntactic patterns post-processing constrained through fixed options resources loaded automatically

Design Background (3) • flaws in SPAACY • monolithic, i.e. no separation of ‘linguistic intelligence’ & output display • hard to improve linguistic analysis • processing & editing of single files only • other interface issues, e.g. too many buttons, etc. • development of DART • modularisation • strict separation of processing and linguistic analysis routines • enhanced options for analysis and creation of resources

DART Annotation Scheme (1) –Basic Input Format text with optional punctuation ‘tags’ or embedded comments optional stylesheet reference basic skeleton can be created via ‘File→New’ (Ctrl + n)

DART Annotation Scheme (1) –Output Format syntactic category mode = semantico-pragmatic markers/’IFIDs’ speech act generally inferred from combination of syntax + mode topic = semantic info speech act(s) (surface) polarity

Basic Automated Annotation to load single file, press Ctrl + a(, for whole directory Ctrl + d) single file loaded;to pre-edit, click hyperlink;to annotate pragmatically, press Ctrl+a single file processed;to post-edit, click hyperlink input files workspace output files workspace debugging output;ignore if annotation completes successfully 

Speech-Act Analysis • generate frequency list of syntactic category + speech act(s) from ‘Analysis→Speech act stats’ • click hyperlinked speech act (combination) to prime concordancer • investigate results • if necessary, correct speech act tag by clicking the hyperlink to the file and editing it

N-Gram Analysis • useful for determining formulaic expressions for modes or topic patterns (or in general) • predefined options for uni- to tri-grams • optionally also freely definable n-grams • frequency lists display abs. & rel. frequencies • hyperlink again primes concordancer • for all n>1 with interpolated optional fillers • due to accommodating mixed-case data, sometimes ‘case insensitive’ flag required

Creating & Editing Resources (1) • mostly done via ‘Edit resources’ menu… • … apart from creating new files • to create new corpus • choose ‘Edit configuration’ • click ‘Add corpus entry’ • fill in corpus, lexicon, and topic filename (usually identical, apart from extension) • click ‘Save configuration’ • new resources created • data folder for corpus • three subfolders: ‘info’, ‘notes’, and ‘stats’ • dummy lexicon & topics files (in relevant program folders)

Creating & Editing Resources (2) • existing resources can be edited… • generally via relevant entry in the ‘Edit resources’ menu • lexica & topic files via hyperlinks in configuration editor • safest to edit only dialogue, lexica & topic files… • … unless you really know what you’re doing  • lexica can also be ‘synthesised’ from corpus data

Creating & Editing Resources (3) –Lexica • very simple format • word (base form) + space + tag + optional comment (preceded by #) • special DART tagset • allows for lexical polysemy • uppercase tag name = unambiguous • lowercase tag name = predominantly tag X • tooltips on tag buttons provide explanations while editing • synthesising lexicon works by • creating word list from corpus • ‘subtracting’ items from general lexicon • suggesting possible candidates after morphological analysis

Creating & Editing Resources (4) –Topic Files • syntax more complex than for lexica • combination of topic labels, space, double colon, space, associated (representative) patterns • patterns expressed as • regexes • individual sub-patterns separated by 3 underscores

Pragmatic Annotation & Analysis in DART