580 likes | 710 Views
Seminar Speech Recognition a Short Overview E.M. Bakker LIACS Media Lab Leiden University. Introduction What is Speech Recognition?. Words. Speech Recognition. “How are you?”. Speech Signal. Other interesting area’s: Who is talker (speaker recognition, identification)
E N D
Seminar • Speech Recognition • a Short Overview • E.M. Bakker • LIACS Media Lab • Leiden University
Introduction What is Speech Recognition? Words Speech Recognition “How are you?” Speech Signal • Other interesting area’s: • Who is talker (speaker recognition, identification) • Speech output (speech synthesis) • What the words mean (speech understanding, semantics) Goal:Automatically extract the string of words spoken from the speech signal
Recognition ArchitecturesA Communication Theoretic Approach Message Source Linguistic Channel Articulatory Channel Acoustic Channel Features Observable: Message Words Sounds • Bayesian formulation for speech recognition: • P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training • Components: • P(A|W) : acoustic model (hidden Markov models, mixtures) • P(W) : language model (statistical, finite state networks, etc.) • The language model typically predicts a small set of next words based on • knowledge of a finite number of previous words (N-grams).
Recognition ArchitecturesIncorporating Multiple Knowledge Sources • The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. • Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions • model temporal structure. Acoustic Front-end Acoustic Models P(A/W) • The language model predicts the next • set of words, and controls which models are hypothesized. Search • Search is crucial to the system, since • many combinations of words must be • investigated to find the most probable • word sequence. Recognized Utterance Input Speech Language Model P(W)
Acoustic ModelingFeature Extraction • Incorporate knowledge of the • nature of speech sounds in • measurement of the features. • Utilize rudimentary models of • human perception. • Measure features 100 times per sec. • Use a 25 msec window forfrequency domain analysis. • Include absolute energy and 12 spectral measurements. • Time derivatives to model spectral change. Fourier Transform Input Speech Cepstral Analysis Perceptual Weighting Time Derivative Time Derivative Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum Energy + Mel-Spaced Cepstrum
Acoustic ModelingHidden Markov Models • Acoustic models encode the temporal evolution of the features (spectrum). • Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. • Phonetic model topologies are simple left-to-right structures. • Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. • Sharing model parameters is a common strategy to reduce complexity.
Acoustic ModelingParameter Estimation • Initialization • Single • Gaussian • Estimation • 2-Way Split • Mixture • Distribution • Reestimation • 4-Way Split • Reestimation ••• • Closed-loop data-driven modeling supervised only from a word-level transcription. • The expectation/maximization (EM) algorithm is used to improve our parameter estimates. • Computationally efficient training algorithms (Forward-Backward) have been crucial. • Batch mode parameter updates are typically preferred. • Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.
Language ModelingN-Grams: The Good, The Bad, and The Ugly • Unigrams (SWB): • Most Common: “I”, “and”, “the”, “you”, “a” • Rank-100: “she”, “an”, “going” • Least Common: “Abraham”, “Alastair”, “Acura” • Bigrams (SWB): • Most Common: “you know”, “yeah SENT!”, • “!SENT um-hum”, “I think” • Rank-100: “do it”, “that we”, “don’t think” • Least Common: “raw fish”, “moisture content”, • “Reagan Bush” • Trigrams (SWB): • Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” • Rank-100: “it was a”, “you know that” • Least Common: “you have parents”, “you seen Brooklyn”
Language ModelingIntegration of Natural Language • Natural language constraints • can be easily incorporated. • Lack of punctuation and search • space size pose problems. • Speech recognition typically • produces a word-level • time-aligned annotation. • Time alignments for other levels • of information also available.
Implementation IssuesSearch Is Resource Intensive • Typical LVCSR systems have about 10M free parameters, which makes training a challenge. • Large speech databases are required (several hundred hours of speech). • Tying, smoothing, and interpolation are required.
Implementation IssuesDynamic Programming-Based Search • Dynamic programming is used to find the most probable path through the network. • Beam search is used to control resources. • Search is time synchronous and left-to-right. • Arbitrary amounts of silence must be permitted between each word. • Words are hypothesized many times with different start/stop times, which significantly increases search complexity.
Implementation IssuesCross-Word Decoding Is Expensive • Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. • Cross-word decoding significantly increases memory requirements.
ApplicationsConversational Speech • Conversational speech collected over the telephone contains background • noise, music, fluctuations in the speech rate, laughter, partial words, • hesitations, mouth noises, etc. • WER (Word Error Rate) has decreased from 100% to 30% in six years. • Laughter • Singing • Unintelligible • Spoonerism • Background Speech • No pauses • Restarts • Vocalized Noise • Coinage
ApplicationsAudio Indexing of Broadcast News • Broadcast news offers some unique • challenges: • Lexicon: important information in • infrequently occurring words • Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) • Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”) • Language: multilingual systems? language-independent acoustic modeling?
ApplicationsReal-Time Translation • Imagine a world where: • You book a travel reservation from your cellular phone while driving in • your car without ever talking to a human (database query) • You converse with someone in a foreign country and neither speaker • speaks a common language (universal translator) • You place a call to your bank to inquire about your bank account and • never have to remember a password (transparent telephony) • You can ask questions by voice and your Internet browser returns • answers to your questions (intelligent query) • From President Clinton’s State of the Union address (January 27, 2000): • “These kinds of innovations are also propelling our remarkable prosperity... • Soon researchers will bring us devices that can translate foreign languages • as fast as you can talk... molecular computers the size of a tear drop with the • power of today’s fastest supercomputers.” • Human Language Engineering: a sophisticated integration of many speech and • language related technologies... a science for the next millennium.
Speech Recognition Erwin M. Bakker Leiden University
THE SPEECH RECOGNITION PROBLEM • Boundaries between words or phonemes • Large variations in speaking rates • in fluent speech words and word-endings are less pronounced • Great deal of inter- as well as intra-speaker variability • Quality of speech signal • Task-inherent syntactic-semantic constraints should be exploited
STATISTICAL METHODS IN SPEECH RECOGNITION • The Bayesian Approach • Acoustic Models • Language Models
Acoustic Models (HMM) Some typical HMM topologies used for acoustic modeling in large vocabulary speech recognition: (a) typical triphone, (b) short pause (c)silence. The shaded states denote the start and stop states for each model.
SEARCH ALGORITHMS • The Complexity of Search. • Typical Search Algorithms • Viterbi Search • Stack Decoders • Multi-Pass Search • Forward-Backward Search
Complexity of Search • lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word) • acoustic models: HMMs that represent the basic sound units the system is capable of recognizing • language model: determines the possible word sequences allowed by the system (encodes knowledge of the syntax and semantics of the language)
References Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone Hierarchical Search for Large Vocabulary Conversational Speech Recognition IEEE Signal Processing Magazine, September 1999. H. Ney and S. Ortmanns Dynamic Programming Search for Continuous Speech Recognition IEEE Signal Processing Magazine, September 1999. V. Zue Talking with Your Computer Scientific American, August 1999
relative complexity of the search problem for large vocabulary conversationalspeech recognition
A TIME-SYNCHRONOUS VITERBI-BASED DECODER • Complexity of Search • Network Decoding • N-Gram Decoding • Cross-Word Acoustic Models • Search Space Organization • Lexical Trees • Language Model Lookahead • Acoustic Evaluation
Network decoding using word-internal context-dependent models. • The word network providing linguistic constraints • The pronunciation lexicon for the words involved • The network expanded using the corresponding word-internal triphones derived from the pronunciations of the words.
A TIME-SYNCHRONOUS VITERBI-BASED DECODER • Search Space Reduction • Pruning • setting pruning beams based on the hypothesis score • limiting the total number of model instances active at a given time • setting an upper bound on the number of words allowed to end at a given frame • Path Merging • Word Graph Compaction
A TIME-SYNCHRONOUS VITERBI-BASED DECODER • System Architecture • PERFORMANCE ANALYSIS • a substitution error refers to the case where the decoder miss-recognizes a word in the reference • sequence as another in the hypothesis. • a deletion error occurs when the there is no word recognized corresponding to a word in the • reference transcription. • an insertion error corresponds to the case where the hypothesis contains an extra word that has no counterpart in the reference.
A TIME-SYNCHRONOUS VITERBI-BASED DECODER • scalability: Can the algorithm scale gracefully from small constrained tasks to large unconstrained • tasks? • recognition accuracy: How accurate is the best word sequence found by the system? • word graph accuracy: Can the system generate alternate choices that contain the correct word • sequence? How large must this list of choices be? • memory: What memory is required to achieve optimal performance? How does performance vary • with the amount of memory required? • run-time: How many seconds of CPU time per second of speech are required (xRT) to achieve • optimal performance? How does run-time vary with performance (run-time should decrease • significantly as error rates increase)?
A TIME-SYNCHRONOUS VITERBI-BASED DECODER • Alphadigits • Switchboard • Beam Pruning • MAPMI Pruning
Comparisons performed on a 333 MHz Pentium II processor with 512MB RAM.