430 likes | 544 Views
The NITE XML Toolkit. Jean Carletta University of Edinburgh HCRC Language Technology Group. NITE XML Toolkit. http://www.ltg.ed.ac.uk/NITE Edinburgh, Stuttgart, DFKI NOT the NITE Workbench for Windows from the University of Southern Denmark. The NITE XML Toolkit.
E N D
The NITE XML Toolkit Jean Carletta University of Edinburgh HCRC Language Technology Group
NITE XML Toolkit http://www.ltg.ed.ac.uk/NITE Edinburgh, Stuttgart, DFKI • NOT the NITE Workbench for Windows from the University of Southern Denmark
The NITE XML Toolkit • integrated support for creating and searching different kinds of annotation on the same speech and video data • data format that allows for distributed data production • some standard GUIs, data utilities • support for writing high quality hand-annotation tools for new tasks quickly
NXT corpus design • data model is multi-rooted tree with arbitrary graph structure over the top • each node has one set of children, multiple parents • annotations often naturally map to a tree • design task is deciding where trees intersect • NXT can represent arbitrary graphs but the more the data has this character, the less useful the search is
Only configuration needed to: • search/index data in NXT format • display data in a standardized (ugly) way • (NXT 1.3.0) do an increasing number of "usual" annotation tasks • dialogue act • named entity • time-stamped labelling like The Observer
Programming tailored interfaces • development time is 1.5 days - 2 weeks depending on • how clear the spec is • complexity of the interface • familiarity with Swing • NXT 1.3.0 will include middleware reducing this and making typical program ~200 lines of code
Recommended Data Paths (1) • Transcribe data outside NXT • Transcriber or multi-channel version of it • Create timestamped base layers either in NXT or in your favourite other tool • The Observer, Anvil, TASX, EventEditor
Recommended Data Paths (2) • Use NXT as a reference storage format for shared data • everyone contributes data to a CVS repository from which different versions of the corpus can be built • work in NXT natively when sensible • to create annotations structured over base layers • search/index • Use NXT's generic utilities (or roll your own) to export data, run it through some machine process, and re-import the result • POS, morphology, automatic annotation based on statistical model
Up-translation into NXT format • existing translations for several common tools • take .5-4 days to write, depending on • documentation of input format • complexity of mapping • complete lattice output of SR takes thought
Why NXT? • best support for distributed creation of hand-annotations structured over transcription • best search facility for integrated data set any other approach takes more dedicated development time; main task here is corpus design and up-translation
Reported Problems at Installation • won't run • zip file truncated during download • forgot to set classpath • don't have Java • can't get signal to play • video codec not installed/not registered in JMF • format not supported by JMF • no one thing to run
Stand-off XML extract from Bdb001.A.speech-quality.xml <speechquality nite:id="Bdb001.emphasis.16" type="emphasis"> <nite:child href="Bdb001.A.words.xml#id(Bdb001.w.1,342)..id(Bdb001.w.1,344)" /> </speechquality> extract from Bdb001.A.words.xml <w nite:id="Bdb001.w.1,342" starttime="356.39" endtime="" c="W">time</w> <w nite:id="Bdb001.w.1,343" starttime="" endtime="" c="HYPH">-</w> <w nite:id="Bdb001.w.1,344" starttime="" endtime="356.59" c="W">line</w>
GUI support (low level) • a central clock keeps data displays/signal in synch • pre-defined display widgets for text areas, trees, grids • interfaces that displays can implement • in order to stay synchronized with clock • to allow search results to be highlighted • predefined GUIs for displaying a dialogue, searching a corpus that work for anything
Metadata file • Equivalent to set of DTDs for the XML files plus: • connections between the files • list of "observations" (coded dialogues/group discussions/texts) • catalog for finding signals and data on disk
Data Handling API • Load corpus or meaningful subparts of a corpus (down to individual XML file) • Data access, traversal, and manipulation with most important validation done on-line • Serialization with choice of standoff syntax • Off-line procedure for full validation All data is held in memory; "dump-n-reload" memory management planned
Simple example query ($w word)($r reference): ($w@POS = “NN”) && ($r ^ $w) Match pairs of words and referring expressions where the word’s part of speech is NN and the word is in the referring expression.
General features of the language • Match variable by no type, single type, or disjunctive type • The usual boolean operators plus some syntactic sugar, like -> • Quantifiers forall and exists (which do not contribute to the n-tuple returned)
Attribute and content tests • Existence • Ordering and equality against numbers and strings • Match to regexp
Temporal tests • Whether data object is timed • Start or end time before, after, same as given time • Same temporal extent, inclusion, abutment, overlap temporal precedence • Start and end times treated as special attributes, for finer comparisons
Structural tests • Identity • Dominance (traceable through 0 to n children) • Precedence (before in some tree ordering) • Relationship via a role, which must be named • Some distance/tree-limited functionality
Complex queries • Evaluate first query, and carry over resulting bindings when evaluating second • Result is a tree • Any n-tuples from the first query that have no matches for the second are removed • Faster to run, more intuitive to write, easier to perform frequency counts
Example complex query ($a w):(TEXT($a) ~ /th.*/):: ($s speechquality):($s ^ $a) && ($s@type="emphasis") • Find instances of words starting with “th” • For each find instances of speech quality tags of type "emphasis" that dominate the word • Discard words that are not dominated by at least one such tag
Uses for queries • Exploring the data • Basic frequency counts • Verifying data quality • Indexing complexes for further use • Finding things for screen rendering in GUI
Warts • Currently builds in-memory representation of complete data set being loaded • work-arounds: process one dialogue at a time, don't load the annotations you don't need • lazy loading and better memory management under development • In large, distributed corpora, pain to assemble the subcorpus you want • build mechanism under development • Some useful things missing from query language • arithmetic • distance-limited precedence