1 / 32

Audio Indexing as a first step in an Audio Information Retrieval System

Audio Indexing as a first step in an Audio Information Retrieval System. Jean-Pierre Martens An Vandecatseye Frederik Stouten. ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent. audio indexing. speech transcription. information querying. audio signal. time stamps audio labels.

yoshe
Download Presentation

Audio Indexing as a first step in an Audio Information Retrieval System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre Martens An Vandecatseye Frederik Stouten ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003)

  2. audio indexing speech transcription information querying audio signal • time stamps • audio labels • time stamps • audio labels • text (summary) • topic labels info This talk Talks of Steve & Roeland Information retrieval from audio General scheme CAIR Twente (10/10/2003)

  3. Why audio indexing? • Extract extra-linguistic information commercial, intro, football report, etc. • Save time let speech recognizer only process parts that are expected to contain speech • Raise speech transcription accuracy allow speech recognizer to select the right models at the right time CAIR Twente (10/10/2003)

  4. Audio indexing in ATRANOS project • Project name • Main project objectives • Automatic segmentation/labeling of audio files • Automatic transcription of the speech parts • Conversion (normalization) of transcriptions for an application (captioning = test vehicle in this project) • Partners ESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA • Status entering its final year CAIR Twente (10/10/2003)

  5. Audio indexing in ATRANOS project • Project name • Main project objectives • Automatic segmentation/labeling of audio files • Automatic transcription of the speech parts • Conversion (normalization) of transcriptions for an application (captioning = test vehicle in this project) • Partners ESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA • Status entering its final year CAIR Twente (10/10/2003)

  6. Audio indexing in ATRANOS project • Mark parts which need no transcription • speech / non-speech segmentation • Detect important change points in speech • change of speaker or acoustics (BW, background) • segment between change points = speaker turn • Assign speaker label to each turn • all frames of one speaker get same label • Assign speech mode to each turn • prepared versus spontaneous speech CAIR Twente (10/10/2003)

  7. Audio indexing in ATRANOS • Additional design goals • aim for continuous input processing (stream-based) • restrict computational load (real-time on PC) • restrict maximum delay (memory) • aim for language independence • Evaluation data • American Broadcast News database (LDC) • Pan-European Broadcast News database (COST278) • Spoken Dutch Corpus (CGN) CAIR Twente (10/10/2003)

  8. 1. Speech / non-speech segmentation • Approach • construct statistical models (GMMs) for typical situations • let these models score individual audio frames • group the frames on the basis of these scores • Which models to build? • one clean speech model • some common background models (e.g. music) • corresponding speech + common background models • a garbage model for all the rest CAIR Twente (10/10/2003)

  9. 1 2 B 3 E 4 Pt 1. Speech / non-speech segmentation • How to group frames? • put models (colored) in a loop model (transition penalty) • compute best state sequence (on-line Viterbi-algorithm with forced decisions) • perform some post-processing on output sequence CAIR Twente (10/10/2003)

  10. football reports 1. Speech / non-speech segmentation • Evaluation results (7 data sets) • training and parameter setting on Am BN • performance degrades for unseen situations CAIR Twente (10/10/2003)

  11. 2. Speaker segmentation • Objective detect changes in speaker/acoustics • Approach • identify change points by comparing properties of observations in two intervals at both sides of this point • advantage: self-organizing (no speaker models) CAIR Twente (10/10/2003)

  12. block of 10 frames left right candidate position (n) both 2. Speaker segmentation • Step 1: potential change position detection • select positions on a grid (CPU-time) • determine fixed length left/right context • build 3 models for the data: M(both), M(left), M(right) • retain significant maxima in LLR(n; two vs one) CAIR Twente (10/10/2003)

  13. left right Speech NS NS n Tmax 2. Speaker segmentation • Step 2: boundary elimination • pool all boundaries in speech part: Tmaxor until EO-S • evaluate variable length context of n using BIC ΔBIC(n) = LL(M2) - LL(M1) - λ [#par(M2) - #par(M1)] log N(n) • select n with minimal ΔBIC(n) • if ΔBIC(n) < 0 : eliminate n and reiterate • if ΔBIC(n)  0 : move to the next speech part CAIR Twente (10/10/2003)

  14. 5 out of 7 2. Speaker segmentation • Evaluation (7 data sets) • recall: how many real changes detected? • precision: how many detected changes are real? CAIR Twente (10/10/2003)

  15. 3. Speaker labeling • Objective assign same label to all turns of the same speaker • Approach • on-line clustering fully integrated in segmentation • BIC as decision criterion • Clustering strategy • for all turns in a speech part: compute ΔBIC between turn and ‘closest’ cluster center • select turn with maximal ΔBIC: • if ΔBIC > 0 take turn as a new cluster • else take turn with smallest ΔBIC and add it to closest cluster CAIR Twente (10/10/2003)

  16. error zones official A B A B A computed A B A B A 3. Speaker labeling • Evaluation methodology • step 1: assign official speaker label to each cluster • step 2: cluster purity = % frames with correct label • step 3: ideal cluster purity: purity for ideal clustering • per speaker: 1 cluster with label of that speaker • per frame in turn: select label of dominant speaker in turn CAIR Twente (10/10/2003)

  17. 3. Speaker labeling • Evaluation results (7 data sets) • training and parameter setting on Am BN • still room for improvement (nr of clusters also > ideal) CAIR Twente (10/10/2003)

  18. demonstration CAIR Twente (10/10/2003)

  19. 4. Speech mode labeling • Objective • spontaneous versus prepared speech • how: presence of disfluencies (prior to recognition) • Disfluencies • filled pauses (uh’s, abnormally lengthened sounds) • repetitions of words or word groups • abbreviations of words • At present • no speech mode labeling results yet • therefore …. CAIR Twente (10/10/2003)

  20. 4. Disfluency detection • Objectives • spontaneous versus prepared speech • how: presence of disfluencies (prior to recognition) • Disfluencies • filled pauses (uh’s, abnormally lengthened sounds) • repetitions of words or word groups • abbreviations of words • At present • no speech mode labeling results yet • therefore …. CAIR Twente (10/10/2003)

  21. 4. Disfluency detection • Approach • perform segmentation into phoneme-sized parts on the basis of cepstral difference measure • identify features revealing FP/NFP nature of these parts • supply these features to a statistical classifier • keep everything stream-based (to fit with the rest) • Feature identification • CGN (Spoken Dutch Corpus): conversational speech • bootstrap data set (11h) • 3255 annotated uh’s • manual word alignments available (location of uh’s) CAIR Twente (10/10/2003)

  22. 4. Disfluency detection • Feature detection on bootstrap data • segment duration CAIR Twente (10/10/2003)

  23. 4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability CAIR Twente (10/10/2003)

  24. 4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations CAIR Twente (10/10/2003)

  25. 4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations • silence present CAIR Twente (10/10/2003)

  26. 4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations • silence present • center of gravity CAIR Twente (10/10/2003)

  27. 4. Disfluency detection • Feature detection on bootstrap data • segment duration • spectral stability • stable interval durations • silence present • center of gravity • output of simple spectral FP-model (GMM) (12 mfccs) result : 12 useful features in total identified CAIR Twente (10/10/2003)

  28. 4. Disfluency detection • Statistical classifier • MLP to estimate P(FP|x) (x = 12 features + 12 mfccs) • problem: very low P(FP) (order of 1 %) • therefore: design filter to eliminate most certain NFP • GMM-based filter • two GMM’s  P(x|FP) and P(x|NFP) (x = 12 features) • prior probability P(FP) = 0.01  P(FP|x) • retain segment if P(FP|x) > threshold • results: 90 % of NFP, < 10 % of FP removed and P(FP) raised from 1 to 10 % CAIR Twente (10/10/2003)

  29. 4. Disfluency detection • Evaluation on independent test set • size : 47 min containing 415 FP • available information • all uh’s (including word internal ones) were annotated • all abnormal sound lengthenings were annotated • all corresponding time intervals were manually checked CAIR Twente (10/10/2003)

  30. 4. Disfluency detection • Evaluation on test data • Recall-precision (ROC) curves Our method R = 75 % P = 85 % Gabrea method R = 60 % P = 65 % • Embedded training (15h unlabeled data) does not help CAIR Twente (10/10/2003)

  31. 4. Disfluency detection and ASR • Baseline system • 40K lexicon + uh (FP), trigram LM • WER = 51.3 % (spontaneous dialogues CGN, uh excluded) • Cheating experiment • remove manually labeled FP-segments from the input • equivalent with : recognize FP’s, ignore them in LM context • equivalent with : remove correct FP’s from input stream • WER = 47.6 % (7.5 % relative gain, 1.25 word corrections/FP) • First real experiment • remove detected FP-segments from the input • WER = 49.4 % (3.7 % relative gain, 0.62 word corrections/FP) CAIR Twente (10/10/2003)

  32. Conclusions • There exist good audio indexing techniques • speech / non-speech segmentation • speaker turn segmentation • speaker identity labeling • filled-pause detection • These techniques can be used • to extract extra-linguistic information for AIR • to guide the speech transcription module CAIR Twente (10/10/2003)

More Related