1 / 15

Pragmatic Annotation & Analysis in DART

Pragmatic Annotation & Analysis in DART. Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org. Outline. Getting DART Design Background DART Annotation Scheme Basic Automated Annotation Speech-Act Analysis

lynsey
Download Presentation

Pragmatic Annotation & Analysis in DART

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pragmatic Annotation & Analysis in DART Martin WeisserSchool of English & EducationGuangdong University of Foreign Studiesweissermar@gmail.com martinweisser.org

  2. Outline • Getting DART • Design Background • DART Annotation Scheme • Basic Automated Annotation • Speech-Act Analysis • N-Gram Analysis • Creating & Editing Resources

  3. Getting DART • go to http://martinweisser.org/ling_soft.html#DART • download & run installer (currently 64bit Win only)

  4. Design Background (1) • 1997–1998: Expert Advisory Group on Language Engineering Standards (EAGLES) WP4 • guidelines for the representation and annotation of dialogue • 2001–2002: SPAAC (A Speech-Act Annotated Corpus of Dialogues) Project • annotation of some 1,200 task-oriented dialogue files (Trainline + BT) • need to annotate and post-edit corpus efficiently and consistently on multiple levels  SPAACy

  5. Design Background (2) colour coding helps to identify syntactic patterns post-processing constrained through fixed options resources loaded automatically

  6. Design Background (3) • flaws in SPAACY • monolithic, i.e. no separation of ‘linguistic intelligence’ & output display • hard to improve linguistic analysis • processing & editing of single files only • other interface issues, e.g. too many buttons, etc. • development of DART • modularisation • strict separation of processing and linguistic analysis routines • enhanced options for analysis and creation of resources

  7. DART Annotation Scheme (1) –Basic Input Format text with optional punctuation ‘tags’ or embedded comments optional stylesheet reference basic skeleton can be created via ‘File→New’ (Ctrl + n)

  8. DART Annotation Scheme (1) –Output Format syntactic category mode = semantico-pragmatic markers/’IFIDs’ speech act generally inferred from combination of syntax + mode topic = semantic info speech act(s) (surface) polarity

  9. Basic Automated Annotation to load single file, press Ctrl + a(, for whole directory Ctrl + d) single file loaded;to pre-edit, click hyperlink;to annotate pragmatically, press Ctrl+a single file processed;to post-edit, click hyperlink input files workspace output files workspace debugging output;ignore if annotation completes successfully 

  10. Speech-Act Analysis • generate frequency list of syntactic category + speech act(s) from ‘Analysis→Speech act stats’ • click hyperlinked speech act (combination) to prime concordancer • investigate results • if necessary, correct speech act tag by clicking the hyperlink to the file and editing it

  11. N-Gram Analysis • useful for determining formulaic expressions for modes or topic patterns (or in general) • predefined options for uni- to tri-grams • optionally also freely definable n-grams • frequency lists display abs. & rel. frequencies • hyperlink again primes concordancer • for all n>1 with interpolated optional fillers • due to accommodating mixed-case data, sometimes ‘case insensitive’ flag required

  12. Creating & Editing Resources (1) • mostly done via ‘Edit resources’ menu… • … apart from creating new files • to create new corpus • choose ‘Edit configuration’ • click ‘Add corpus entry’ • fill in corpus, lexicon, and topic filename (usually identical, apart from extension) • click ‘Save configuration’ • new resources created • data folder for corpus • three subfolders: ‘info’, ‘notes’, and ‘stats’ • dummy lexicon & topics files (in relevant program folders)

  13. Creating & Editing Resources (2) • existing resources can be edited… • generally via relevant entry in the ‘Edit resources’ menu • lexica & topic files via hyperlinks in configuration editor • safest to edit only dialogue, lexica & topic files… • … unless you really know what you’re doing  • lexica can also be ‘synthesised’ from corpus data

  14. Creating & Editing Resources (3) –Lexica • very simple format • word (base form) + space + tag + optional comment (preceded by #) • special DART tagset • allows for lexical polysemy • uppercase tag name = unambiguous • lowercase tag name = predominantly tag X • tooltips on tag buttons provide explanations while editing • synthesising lexicon works by • creating word list from corpus • ‘subtracting’ items from general lexicon • suggesting possible candidates after morphological analysis

  15. Creating & Editing Resources (4) –Topic Files • syntax more complex than for lexica • combination of topic labels, space, double colon, space, associated (representative) patterns • patterns expressed as • regexes • individual sub-patterns separated by 3 underscores

More Related