1 / 51

By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönm

“Pushing the Envelope” A six month report. By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer. Overview Nelson Morgan, ICSI.

uyen
Download Presentation

By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Pushing the Envelope” A six month report By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer

  2. OverviewNelson Morgan, ICSI

  3. The Current Cast of Characters • ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington • UW: M. Ostendorf, Ö. Çetin • OGI: H. Hermansky, S. Sivadas, P. Jain • Columbia: D. Ellis, M. Athineos • SRI: K. Sönmez • IDIAP: H. Bourlard, J. Ajmera, V. Tyagi

  4. Rethinking Acoustic Processing for ASR • Escape dependence on spectral envelope • Use multiple front-ends across time/freq • Modify statistical models to accommodate new front-ends • Design optimal combination schemes for multiple models

  5. 10 ms estimate of sound identity estimate of sound identity up to 1s kth estimate information fusion ith estimate time nth estimate Task 1: Pushing the Envelope (aside) OLD • Problem: Spectral envelope is a fragile information carrier PROPOSED • Solution:Probabilities from multiple time-frequency patches

  6. short-term features conventional HMM advanced features Task 2: Beyond Frames… OLD • Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm • Problem: Features & models interact; new features may require different models PROPOSED multi-rate, dynamic-scale classifier

  7. Today’s presentation • Infrastructure: training, testing, software • Initial Experiments: pilot studies • Directions: where we’re headed

  8. Infrastructure Kemal Sönmez, SRI (SRI/UW/ICSI effort)

  9. Initial Experimental Paradigm • Focus on a small task to facilitate exploratory work (later move to CTS) • Choose a task where LM is fixed & plays a minor role (to focus on acoustics) • Use mismatched train/test data: • To avoid tuning to the task • To facilitate later move to CTS • Task: OGI numbers/ Train: swbd+macrophone

  10. Hub5 “Short” Training Set • Composition (total ~ 60 hours) * subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations • WER 2-4% higher vs. full 250+ hour training

  11. Reduced UW Training Set • A reduced training set to shorten expt. turn-around time • Choose training utterances with per-frame likelihood scores close to the training set average • 1/4th of the original training set • Statistics (gender, data set constituencies) are similar to that of the full training set. • For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5).

  12. Development Test Sets • A “Core-Subset” of OGI’s Numbers 95 corpora – telephone speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items • “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers • Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.)

  13. Statistical Modeling Tools • HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging • GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streams • Allows direct dependencies across streams • Not limited by single-rate, single-stream paradigm • Rapid model specification/training/testing • SRI Decipher system for providing lattices to rescore (later in CTS expts) • Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP

  14. Baseline SRI Recognizerfor the numbers task • Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling • Acoustic adaptation to speakers using affine mean and variance transforms[Not used for numbers] • Vocal-tract length normalization using maximum likelihood estimation [Not helpful for numbers] • Progressive search with lattice recognition and N-best rescoring [To be used in later work] • Bigram LM

  15. Initial Experiments Barry Chen, ICSI Hynek Hermansky, OHSU (OGI) Özgür Çetin, UW

  16. Goals of Initial Experiments • Establish performance baselines • HMM + standard features (MFCC, PLP) • HMM + current best from ICSI/OGI • Develop infrastructure for new models • GMTK for multi-stream & multi-rate features • Novel features based on large timespans • Novel features based on temporal fine structure • Provide fodder for future error analysis

  17. ICSI Baseline experiments • PLP based - SRI system • “Tandem” PLP-based ANN + SRI system • Initial combination approach

  18. Development Baseline: Gender Independent PLP System

  19. Phonetically Trained Neural Net • Multi-Layer Perceptron (input, hidden, and output layer) • Trained Using Error-Backpropagation Technique – outputs interpreted as posterior probabilities of target classes • Training Targets: 47 mono-phone targets from forced alignment using SRI Eval 2002 system • Training Utterances: UW Reduced Hub5 Set • Training Features: PLP12+e+d+dd, mean & variance normalized on per-conversation side basis • MLP Topology: • 9 Frame Context Window (4 frames in past + current frame + 4 frames in future) • 351 Input Units, 1500 Hidden Units, and 47 Output Units • Total Number of Parameters: ~600k

  20. Baseline ICSI Tandem • Outputs of Neural Net before final softmax non-linearity used as inputs to PCA • PCA without dimensionality reduction • 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set

  21. Baseline ICSI Tandem+PLP • PLP Stream concatenated with neural net posteriors stream • PCA reduces dimensionality of posteriors stream to 16 (keeping 95% of overall variance) • 3.3% Word and 9.5% Sentence Error Rate on Numbers95-CS test set

  22. Word and String Error Rates on Numbers95-CS Test Set

  23. OGI Experiments:New Features in EARS • Develop on home-grown ASR system (phoneme-based HTK) • Pass the most promising to ICSI for running in SRI LVCSR system • So far • new features match the performance of the baseline PLP features but do not exceed it • advantage seen in combination with the baseline

  24. Psychophysics Components within certain frequency range (several critical bands) interact [e.g. frequency masking] Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking] Physiology 2-D (time-frequency) matched filters for activity in auditory cortex [cortical receptive fields] Looking to the human auditory system for design inspiration

  25. Multilayer Perceptron (MLP) Posterior probabilities of phonemes 101 point input Multilayer Perceptron (MLP) Mean & variance normalized, hamming windowed critical band trajectory Multilayer Perceptron (MLP) TRAP-based HMM-NN hybrid ASR Search for the best match

  26. MLP transform HMM ASR TANDEM transform MLP Feature estimation from linearly transformed temporal patterns ? ? ?

  27. Preliminary TANDEM/TRAP results (OGI-HTK) WER% on OGI numbers, training on UW reduced training set, monophone models

  28. Features from more than one critical-band temporal trajectory Studying KLT-derived basis functions, we observe: cosine transform + frequency derivative average

  29. UW Baseline Experiments • Constructed an HTK-based HMM system that is competitive with the SRI system • Replicated the HMM system in GMTK • Move on to models which integrate information from multiple sources in a principled manner: • Multiple feature streams (multi-stream models) • Different time scales (multi-rate models) • Focus on statistical models not on feature extraction

  30. HTK HMM Baseline • An HTK-based standard HMM system: • 3 state triphones with decision-tree clustering, • Mixture of diagonal Gaussians as state output dists., • No adaptation, fixed LM. • Dimensions explored: • Front-end: PLP vs. MFCC, VTLN • Gender dependent vs. independent modeling • Conclusions: • No significant performance differences • Decided on PLPs, no VTLN, gender-independent models for simplicity

  31. HMM Baselines (cont.) • Replicated HTK baseline with equivalent results in GMTK • To reduce experiment turn-around time, wanted to reduce the training set • For HMMs and Numbers95, 3/4th of the training data can be safely ignored:

  32. feature stream X state seq. of stream X feature stream Y Multi-stream Models STATE TOPOLOGY • Information fusion from multiple streams of features • Partially asynchronous state sequences GRAPHICAL MODEL states of stream X state seq. of stream Y states of stream Y

  33. Temporal envelope features(Columbia) • Temporal fine structure is lost (deliberately) in STFT features: • Need a compact, parametric description... 10 ms windows

  34. Frequency-DomainLinear Prediction (FDLP) • Extend LPC with LP model of spectrum • ‘Poles’ represent temporal peaks: • Features ~ pole bandwidth, ‘frequency’ TD-LP y[n] = Siaiy[n-i] FD-LP Y[k] = SibiY[k-i] DFT

  35. Preliminary FDLP Results • Distribution of pole magnitudes for different phone classes (in 4 bands): • NN Classifier Frame Accuracies:

  36. Directions Dan Ellis, Columbia(SRI/UW/Columbia work) Nelson Morgan, ICSI (OGI/IDIAP/ICSI work + summary)

  37. Multi-rate Models (UW) • Integrate acoustic information from different time scales • Account for dependencies across scales • Better robustness against time- and/or frequency localized interferences • Reduced redundancy gives better confidence estimates long-term features coarse state chain Cross-scale dependencies (example) fine state chain short-term features

  38. SRI Directions • Task 1:Signal-adaptive weighting of time-frequency patches • Basis-entropy based representation • Matching pursuit search for optimal weighting of patches • Optimality based on minimum entropy criterion • Task 2:Graphical models of patch combinations • Tiling-driven dependency modeling • GM combines across patch selections • Optimality based on information in representation

  39. Data-derived phonetic features (Columbia) • Find a set of independent attributes to account for phonetic (lexical) distinctions • phones replaced by feature streams • Will require new pronunciation models • asynchronous feature transitions (no phones) • mapping from phonetics (for unseen words) Joint work with Eric Fosler-Lussier

  40. ICA for feature bases • PCA finds decorrelated bases;ICA finds independent bases • Lexically-sufficient ICA basis set?

  41. OGI Directions:Targets in sub-bands • Initially context-independent and band-specific phonemes • Gradually shifted to band-specific 6 broad phonetic classes (stops, fricatives, nasals, vowels, silence, flaps) • Moving towards band-independent speech classes (vocalic-like, fricative-like, plosive-like, ???)

  42. More than one temporal pattern? MLP KLT1 101 dim KLTn MLP Mean & Variance normalized, Hamming windowed critical band trajectory

  43. Pre-processing by 2-D operatorswith subsequent TRAP-TANDEM * frequency time differentiate f average t differentiate t average f diff upwards av downwards diff downwards av upwards

  44. IDIAP Directions:Phase AutoCorrelation Features Traditional Features: Autocorrelation based. Very sensitive to additive noise, other variations. Phase AutoCorrelation (PAC): if represents autocorrelation coeffs derived from a frame of length PACs:

  45. Entropy Based Multi-Stream Combination • Combination of evidences from more than one expert to improve performance • Entropy as a measure of confidence • Experts having low entropy are more reliable as compared to experts having high entropy • Inverse entropy weighting criterion • Relationship between entropy of the resulting (recombined) classifier and recognition rate

  46. ICSI Directions:Posterior Combination Framework • Combination of Several Discriminative Probability Streams

  47. Improvement of the Combo Infrastructure • Improve basic features: • Add prosodic features: voicing level, energy continuity, • Improve PLP by further removing the pitch difference among speakers. • Tandem • Different targets, different training features. E.g.: word boundary. • Improve TRAP (OGI) • Combination • Entropy based, accuracy based stream weighting or stream selection.

  48. New types of tandem features: Possible word/syllable boundary NN Processing Target posterior Input feature • Input feature: • Traditional or improved PLP • Spectral continuity • Voicing, voicing continuity • Formant continuity feature • …more • Phonemes • Word/syllable boundary • Broad phoneme classes • Manner/ place / articulation… etc

  49. Initial segmentation: large number of clusters Is thresholdless BIC-likemerging criterion met? Stop No Yes Merge, re-segment, and re-estimate Data Driven Subword Unit Generation (IDIAP/ICSI) • Motivation: • Phoneme-based units may not be optimal for ASR. • Approach (based on speaker segmentation method):

  50. Summary • Staff and tools in place to proceed with core experiments • Pilot experiments provided coherent substrate for cooperation between 6 sites • Future directions for individual sites are all over the map, which is what we want • Possible exploration of collaborations w/MS in this meeting

More Related