120 likes | 288 Views
NXT meets the ICSI Corpus. Jean Carletta and Jonathan Kilgour University of Edinburgh HCRC Language Technology Group. ICSI Meeting Corpus. 75 natural meetings from research groups close-talking and far-field microphones orthographic transcription "speech quality" tags (e.g., emphasis)
E N D
NXT meets the ICSI Corpus Jean Carletta and Jonathan Kilgour University of Edinburgh HCRC Language Technology Group
ICSI Meeting Corpus • 75 natural meetings from research groups • close-talking and far-field microphones • orthographic transcription • "speech quality" tags (e.g., emphasis) • dialogue acts using MRDA • hot spots
The NITE XML Toolkit • library support for data handling and search using a data model that can express both timing and complex structure • multiple file stand-off XML data storage • some standard GUIs, data utilities • library support for writing tailored GUIs
Stand-off XML extract from Bdb001.A.speech-quality.xml <speechquality nite:id="Bdb001.emphasis.16" type="emphasis"> <nite:child href="Bdb001.A.words.xml#id(Bdb001.w.1,342)..id(Bdb001.w.1,344)" /> </speechquality> extract from Bdb001.A.words.xml <w nite:id="Bdb001.w.1,342" starttime="356.39" endtime="" c="W">time</w> <w nite:id="Bdb001.w.1,343" starttime="" endtime="" c="HYPH">-</w> <w nite:id="Bdb001.w.1,344" starttime="" endtime="356.59" c="W">line</w>
Tasks • pre-NXT: up-translation and tokenization • hand annotation (topic segmentation, dialogue acts, extractive summaries, ...) • automatic annotation/indexing by query match
Queries in NXT ($a w):(TEXT($a) ~ /th.*/):: ($s speechquality):($s ^ $a) && ($s@type="emphasis") • Find instances of words starting with “th” • For each find instances of speech quality tags of type "emphasis" that dominate the word • Discard words that are not dominated by at least one such tag Use queries to understand data, verify quality, index.
NXT as Meeting Browser • Browser = display + signal indexing + search • NXT data displays: • synchronize with signal • highlight search results
Issues • Already can't load all the ICSI data at once on some machines • NXT supports display of one meeting at a time but browsing may be over several meetings • Really complicated queries are often too slow for browser response times Key: Pre-indexing of query results, tailored data builds
Conclusions • NXT available, free, open source, useful in a surprising number of ways http://www.ltg.ed.ac.uk/NITE