320 likes | 596 Views
Audio Indexing as a first step in an Audio Information Retrieval System. Jean-Pierre Martens An Vandecatseye Frederik Stouten. ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent. audio indexing. speech transcription. information querying. audio signal. time stamps audio labels.
E N D
Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre Martens An Vandecatseye Frederik Stouten ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003)
audio indexing speech transcription information querying audio signal • time stamps • audio labels • time stamps • audio labels • text (summary) • topic labels info This talk Talks of Steve & Roeland Information retrieval from audio General scheme CAIR Twente (10/10/2003)
Why audio indexing? • Extract extra-linguistic information commercial, intro, football report, etc. • Save time let speech recognizer only process parts that are expected to contain speech • Raise speech transcription accuracy allow speech recognizer to select the right models at the right time CAIR Twente (10/10/2003)
Audio indexing in ATRANOS project • Project name • Main project objectives • Automatic segmentation/labeling of audio files • Automatic transcription of the speech parts • Conversion (normalization) of transcriptions for an application (captioning = test vehicle in this project) • Partners ESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA • Status entering its final year CAIR Twente (10/10/2003)
Audio indexing in ATRANOS project • Project name • Main project objectives • Automatic segmentation/labeling of audio files • Automatic transcription of the speech parts • Conversion (normalization) of transcriptions for an application (captioning = test vehicle in this project) • Partners ESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA • Status entering its final year CAIR Twente (10/10/2003)
Audio indexing in ATRANOS project • Mark parts which need no transcription • speech / non-speech segmentation • Detect important change points in speech • change of speaker or acoustics (BW, background) • segment between change points = speaker turn • Assign speaker label to each turn • all frames of one speaker get same label • Assign speech mode to each turn • prepared versus spontaneous speech CAIR Twente (10/10/2003)
Audio indexing in ATRANOS • Additional design goals • aim for continuous input processing (stream-based) • restrict computational load (real-time on PC) • restrict maximum delay (memory) • aim for language independence • Evaluation data • American Broadcast News database (LDC) • Pan-European Broadcast News database (COST278) • Spoken Dutch Corpus (CGN) CAIR Twente (10/10/2003)
1. Speech / non-speech segmentation • Approach • construct statistical models (GMMs) for typical situations • let these models score individual audio frames • group the frames on the basis of these scores • Which models to build? • one clean speech model • some common background models (e.g. music) • corresponding speech + common background models • a garbage model for all the rest CAIR Twente (10/10/2003)
1 2 B 3 E 4 Pt 1. Speech / non-speech segmentation • How to group frames? • put models (colored) in a loop model (transition penalty) • compute best state sequence (on-line Viterbi-algorithm with forced decisions) • perform some post-processing on output sequence CAIR Twente (10/10/2003)
football reports 1. Speech / non-speech segmentation • Evaluation results (7 data sets) • training and parameter setting on Am BN • performance degrades for unseen situations CAIR Twente (10/10/2003)
2. Speaker segmentation • Objective detect changes in speaker/acoustics • Approach • identify change points by comparing properties of observations in two intervals at both sides of this point • advantage: self-organizing (no speaker models) CAIR Twente (10/10/2003)
block of 10 frames left right candidate position (n) both 2. Speaker segmentation • Step 1: potential change position detection • select positions on a grid (CPU-time) • determine fixed length left/right context • build 3 models for the data: M(both), M(left), M(right) • retain significant maxima in LLR(n; two vs one) CAIR Twente (10/10/2003)
left right Speech NS NS n Tmax 2. Speaker segmentation • Step 2: boundary elimination • pool all boundaries in speech part: Tmaxor until EO-S • evaluate variable length context of n using BIC ΔBIC(n) = LL(M2) - LL(M1) - λ [#par(M2) - #par(M1)] log N(n) • select n with minimal ΔBIC(n) • if ΔBIC(n) < 0 : eliminate n and reiterate • if ΔBIC(n) 0 : move to the next speech part CAIR Twente (10/10/2003)
5 out of 7 2. Speaker segmentation • Evaluation (7 data sets) • recall: how many real changes detected? • precision: how many detected changes are real? CAIR Twente (10/10/2003)
3. Speaker labeling • Objective assign same label to all turns of the same speaker • Approach • on-line clustering fully integrated in segmentation • BIC as decision criterion • Clustering strategy • for all turns in a speech part: compute ΔBIC between turn and ‘closest’ cluster center • select turn with maximal ΔBIC: • if ΔBIC > 0 take turn as a new cluster • else take turn with smallest ΔBIC and add it to closest cluster CAIR Twente (10/10/2003)
error zones official A B A B A computed A B A B A 3. Speaker labeling • Evaluation methodology • step 1: assign official speaker label to each cluster • step 2: cluster purity = % frames with correct label • step 3: ideal cluster purity: purity for ideal clustering • per speaker: 1 cluster with label of that speaker • per frame in turn: select label of dominant speaker in turn CAIR Twente (10/10/2003)
3. Speaker labeling • Evaluation results (7 data sets) • training and parameter setting on Am BN • still room for improvement (nr of clusters also > ideal) CAIR Twente (10/10/2003)
demonstration CAIR Twente (10/10/2003)
4. Speech mode labeling • Objective • spontaneous versus prepared speech • how: presence of disfluencies (prior to recognition) • Disfluencies • filled pauses (uh’s, abnormally lengthened sounds) • repetitions of words or word groups • abbreviations of words • At present • no speech mode labeling results yet • therefore …. CAIR Twente (10/10/2003)
4. Disfluency detection • Objectives • spontaneous versus prepared speech • how: presence of disfluencies (prior to recognition) • Disfluencies • filled pauses (uh’s, abnormally lengthened sounds) • repetitions of words or word groups • abbreviations of words • At present • no speech mode labeling results yet • therefore …. CAIR Twente (10/10/2003)
4. Disfluency detection • Approach • perform segmentation into phoneme-sized parts on the basis of cepstral difference measure • identify features revealing FP/NFP nature of these parts • supply these features to a statistical classifier • keep everything stream-based (to fit with the rest) • Feature identification • CGN (Spoken Dutch Corpus): conversational speech • bootstrap data set (11h) • 3255 annotated uh’s • manual word alignments available (location of uh’s) CAIR Twente (10/10/2003)
4. Disfluency detection • Feature detection on bootstrap data • segment duration CAIR Twente (10/10/2003)
4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability CAIR Twente (10/10/2003)
4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations CAIR Twente (10/10/2003)
4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations • silence present CAIR Twente (10/10/2003)
4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations • silence present • center of gravity CAIR Twente (10/10/2003)
4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations • silence present • center of gravity • output of simple spectral FP-model (GMM) (12 mfccs) result : 12 useful features in total identified CAIR Twente (10/10/2003)
4. Disfluency detection • Statistical classifier • MLP to estimate P(FP|x) (x = 12 features + 12 mfccs) • problem: very low P(FP) (order of 1 %) • therefore: design filter to eliminate most certain NFP • GMM-based filter • two GMM’s P(x|FP) and P(x|NFP) (x = 12 features) • prior probability P(FP) = 0.01 P(FP|x) • retain segment if P(FP|x) > threshold • results: 90 % of NFP, < 10 % of FP removed and P(FP) raised from 1 to 10 % CAIR Twente (10/10/2003)
4. Disfluency detection • Evaluation on independent test set • size : 47 min containing 415 FP • available information • all uh’s (including word internal ones) were annotated • all abnormal sound lengthenings were annotated • all corresponding time intervals were manually checked CAIR Twente (10/10/2003)
4. Disfluency detection • Evaluation on test data • Recall-precision (ROC) curves Our method R = 75 % P = 85 % Gabrea method R = 60 % P = 65 % • Embedded training (15h unlabeled data) does not help CAIR Twente (10/10/2003)
4. Disfluency detection and ASR • Baseline system • 40K lexicon + uh (FP), trigram LM • WER = 51.3 % (spontaneous dialogues CGN, uh excluded) • Cheating experiment • remove manually labeled FP-segments from the input • equivalent with : recognize FP’s, ignore them in LM context • equivalent with : remove correct FP’s from input stream • WER = 47.6 % (7.5 % relative gain, 1.25 word corrections/FP) • First real experiment • remove detected FP-segments from the input • WER = 49.4 % (3.7 % relative gain, 0.62 word corrections/FP) CAIR Twente (10/10/2003)
Conclusions • There exist good audio indexing techniques • speech / non-speech segmentation • speaker turn segmentation • speaker identity labeling • filled-pause detection • These techniques can be used • to extract extra-linguistic information for AIR • to guide the speech transcription module CAIR Twente (10/10/2003)