1 / 24

Speech Recognition Introduction I

Speech Recognition Introduction I. E.M. Bakker. Speech Recognition. Some Applications An Overview General Architecture Speech Production Speech Perception. Words. Speech Recognition. “How are you?”. Speech Signal. Speech Recognition.

pdozier
Download Presentation

Speech Recognition Introduction I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech RecognitionIntroduction I E.M. Bakker LML Speech Recognition 2008

  2. Speech Recognition • Some Applications • An Overview • General Architecture • Speech Production • Speech Perception LML Speech Recognition 2008

  3. Words Speech Recognition “How are you?” Speech Signal Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal LML Speech Recognition 2008

  4. Words Speech Recognition “How are you?” Speech Signal Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal • How is SPEECH produced? • Characteristics of Acoustic Signal LML Speech Recognition 2008

  5. Words Speech Recognition “How are you?” Speech Signal Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal How is SPEECH perceived? => Important Features LML Speech Recognition 2008

  6. Words Speech Recognition “How are you?” Speech Signal Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal What LANGUAGE is spoken? => Language Model LML Speech Recognition 2008

  7. Input Speech Acoustic Front-end Acoustic Models P(A/W) Words Speech Recognition “How are you?” Speech Signal Language Model P(W) Search Recognized Utterance Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal What is in the BOX? LML Speech Recognition 2008

  8. Important Componentsof General SR Architecture • Speech Signals • Signal Processing Functions • Parameterization • Acoustic Modeling (Learning Phase) • Language Modeling (Learning Phase) • Search Algorithms and Data Structures • Evaluation LML Speech Recognition 2008

  9. Recognition ArchitecturesA Communication Theoretic Approach Message Source Linguistic Channel Articulatory Channel Acoustic Channel Features Observable: Message Words Sounds Speech Recognition Problem: P(W|A), where A is acoustic signal, W words spoken Objective: minimize the word error rate Approach: maximize P(W|A) during training • Bayesian formulation for speech recognition: • P(W|A) = P(A|W) P(W) / P(A), A is acoustic signal, W words spoken • Components: • P(A|W) : acoustic model (hidden Markov models, mixtures) • P(W) : language model (statistical, finite state networks, etc.) • The language model typically predicts a small set of next words based on • knowledge of a finite number of previous words (N-grams). LML Speech Recognition 2008

  10. Recognition Architectures • The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. • Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which: • states model spectral structure and • transitions model temporal structure. Acoustic Front-end Acoustic Models P(A|W) • The language model predicts the next • set of words, and controls which models are hypothesized. Search • Search is crucial to the system, since • many combinations of words must be • investigated to find the most probable • word sequence. Recognized Utterance Input Speech Language Model P(W) LML Speech Recognition 2008

  11. ASR Architecture Evaluators Feature Extraction Recognition: Searching Strategies Speech Database, I/O HMM Initialisation and Training Common BaseClasses Configuration and Specification Language Models LML Speech Recognition 2008

  12. Signal ProcessingFunctionality • Acoustic Transducers • Sampling and Resampling • Temporal Analysis • Frequency Domain Analysis • Ceps-tral Analysis • Linear Prediction and LP-Based Representations • Spectral Normalization LML Speech Recognition 2008

  13. Acoustic Modeling: Feature Extraction • Incorporate knowledge of the • nature of speech sounds in • measurement of the features. • Utilize rudimentary models of • human perception. • Measure features 100 times per sec. • Use a 25 msec window forfrequency domain analysis. • Include absolute energy and 12 spectral measurements. • Time derivatives to model spectral change. Fourier Transform Input Speech Cepstral Analysis Perceptual Weighting Time Derivative Time Derivative Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum Energy + Mel-Spaced Cepstrum LML Speech Recognition 2008

  14. Acoustic Modeling • Dynamic Programming • Markov Models • Parameter Estimation • HMM Training • Continuous Mixtures • Decision Trees • Limitations and Practical Issues of HMM LML Speech Recognition 2008

  15. Acoustic ModelingHidden Markov Models • Acoustic models encode the temporal evolution of the features (spectrum). • Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. • Phonetic model topologies are simple left-to-right structures. • Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. • Sharing model parameters is a common strategy to reduce complexity. LML Speech Recognition 2008

  16. Acoustic Modeling: Parameter Estimation • Initialization • Single • Gaussian • Estimation • 2-Way Split • Mixture • Distribution • Reestimation • 4-Way Split • Reestimation ••• • Closed-loop data-driven modeling supervised from a word-level transcription. • The expectation/maximization (EM) algorithm is used to improve our parameter estimates. • Computationally efficient training algorithms (Forward-Backward) have been crucial. • Batch mode parameter updates are typically preferred. • Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge. LML Speech Recognition 2008

  17. Language Modeling • Formal Language Theory • Context-Free Grammars • N-Gram Models and Complexity • Smoothing LML Speech Recognition 2008

  18. Language Modeling LML Speech Recognition 2008

  19. Language Modeling: N-Grams • Unigrams (SWB): • Most Common: “I”, “and”, “the”, “you”, “a” • Rank-100: “she”, “an”, “going” • Least Common: “Abraham”, “Alastair”, “Acura” • Bigrams (SWB): • Most Common: “you know”, “yeah SENT!”, • “!SENT um-hum”, “I think” • Rank-100: “do it”, “that we”, “don’t think” • Least Common: “raw fish”, “moisture content”, • “Reagan Bush” • Trigrams (SWB): • Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” • Rank-100: “it was a”, “you know that” • Least Common: “you have parents”, “you seen Brooklyn” LML Speech Recognition 2008

  20. LM: Integration of Natural Language • Natural language constraints • can be easily incorporated. • Lack of punctuation and search • space size pose problems. • Speech recognition typically • produces a word-level • time-aligned annotation. • Time alignments for other levels • of information also available. LML Speech Recognition 2008

  21. Search Algorithms and Data Structures • Basic Search Algorithms • Time Synchronous Search • Stack Decoding • Lexical Trees • Efficient Trees LML Speech Recognition 2008

  22. Dynamic Programming-Based Search • Search is time synchronous and left-to-right. • Arbitrary amounts of silence must be permitted between each word. • Words are hypothesized many times with different start/stop times, which significantly increases search complexity. LML Speech Recognition 2008

  23. Recognition Architectures • The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. • Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which: • states model spectral structure and • transitions model temporal structure. Acoustic Front-end Acoustic Models P(A|W) • The language model predicts the next • set of words, and controls which models are hypothesized. Search • Search is crucial to the system, since • many combinations of words must be • investigated to find the most probable • word sequence. Recognized Utterance Input Speech Language Model P(W) LML Speech Recognition 2008

  24. Words Speech Recognition “How are you?” Speech Signal Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal How is SPEECH produced? LML Speech Recognition 2008

More Related