Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem.

Evaluation MetricsEvolution Word Error Rate Conversational Speech 40% • Spontaneous telephone speech is still a “grand challenge”. • Telephone-quality speech is still central to the problem. • Vision for speech technology continues • to evolve. • Broadcast news is a very dynamic domain. 30% Broadcast News 20% Read Speech 10% Continuous Digits Letters and Numbers Digits Command and Control 0% Level Of Difficulty

Evaluation MetricsHuman Performance Word Error Rate • Human performance exceeds machine • performance by a factor ranging from • 4x to 10x depending on the task. • On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. • The nature of the noise is as important as the SNR (e.g., cellular phones). • A primary failure mode for humans is inattention. • A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). 20% Wall Street Journal (Additive Noise) 15% Machines 10% 5% Human Listeners (Committee) 0% Quiet 10 dB 16 dB 22 dB Speech-To-Noise Ratio

Evaluation MetricsMachine Performance • Common evaluations fuel • technology development. • Tasks become progressively • more ambitious and challenging. • A Word Error Rate (WER) • below 10% is considered • acceptable. • Performance in the field is • typically 2x to 4x worse than • performance on an evaluation. 100% (Foreign) Read Speech Conversational Speech Broadcast Speech 20k Spontaneous Speech Varied Microphones (Foreign) 10 X 10% 5k Noisy 1k 1% 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Evaluation MetricsBeyond WER: Named Entity F-Measure 100% • An example of named entity annotation: • Mr. <en type=“person”>Sears</en> bought • a new suit at <en type=“org”> Sears</en> • in <en type=“location”>Washington</en> • <time type=“date”>yesterday</time> • Evaluation Metrics: 90% 80% # slots correctly filled # slots filled in key Recall = 70% # slots correctly filled # slots filled by system 30% 0% 10% 20% Precision = Word Error Rate (Hub-4 Eval’98) 2 x recall x precision (recall + precision) F-Measure = • Information extraction is the analysis of • natural language to collect information • about specified types of entities. • As the focus shifts to providing enhanced annotations, WER may not be the most appropriate measure of performance (content-based scoring).

Recognition ArchitecturesWhy Is Speech Recognition So Difficult? • Comparison of “aa” in “IOck” vs. “iy” in bEAt for conversational speech (SWB) Feature No. 2 Ph_1 Ph_2 Ph_3 Feature No. 1 • Our measurements of the • signal are ambiguous. • Region of overlap represents classification errors. • Reduce overlap by introducing acoustic and linguistic context (e.g., context-dependent phones).

Recognition ArchitecturesA Communication Theoretic Approach Message Source Linguistic Channel Articulatory Channel Acoustic Channel Features Observable: Message Words Sounds • Bayesian formulation for speech recognition: • P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training • Components: • P(A|W) : acoustic model (hidden Markov models, mixtures) • P(W) : language model (statistical, finite state networks, etc.) • The language model typically predicts a small set of next words based on • knowledge of a finite number of previous words (N-grams).

Recognition ArchitecturesIncorporating Multiple Knowledge Sources • The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. • Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions • model temporal structure. Acoustic Front-end Acoustic Models P(A/W) • The language model predicts the next • set of words, and controls which models are hypothesized. Search • Search is crucial to the system, since • many combinations of words must be • investigated to find the most probable • word sequence. Recognized Utterance Input Speech Language Model P(W)

Acoustic ModelingFeature Extraction • Incorporate knowledge of the • nature of speech sounds in • measurement of the features. • Utilize rudimentary models of • human perception. • Measure features 100 times per sec. • Use a 25 msec window forfrequency domain analysis. • Include absolute energy and 12 spectral measurements. • Time derivatives to model spectral change. Fourier Transform Input Speech Cepstral Analysis Perceptual Weighting Time Derivative Time Derivative Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum Energy + Mel-Spaced Cepstrum

Acoustic ModelingHidden Markov Models • Acoustic models encode the temporal evolution of the features (spectrum). • Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. • Phonetic model topologies are simple left-to-right structures. • Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. • Sharing model parameters is a common strategy to reduce complexity.

Acoustic ModelingParameter Estimation • Initialization • Single • Gaussian • Estimation • 2-Way Split • Mixture • Distribution • Reestimation • 4-Way Split • Reestimation ••• • Closed-loop data-driven modeling supervised only from a word-level transcription • The expectation/maximization (EM) algorithm is used to improve our parameter estimates. • Computationally efficient training algorithms (Forward-Backward) have been crucial. • Batch mode parameter updates are typically preferred. • Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.

Language ModelingIs A Lot Like Wheel of Fortune

Language ModelingN-Grams: The Good, The Bad, and The Ugly • Unigrams (SWB): • Most Common: “I”, “and”, “the”, “you”, “a” • Rank-100: “she”, “an”, “going” • Least Common: “Abraham”, “Alastair”, “Acura” • Bigrams (SWB): • Most Common: “you know”, “yeah SENT!”, • “!SENT um-hum”, “I think” • Rank-100: “do it”, “that we”, “don’t think” • Least Common: “raw fish”, “moisture content”, • “Reagan Bush” • Trigrams (SWB): • Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” • Rank-100: “it was a”, “you know that” • Least Common: “you have parents”, “you seen Brooklyn”

Language ModelingIntegration of Natural Language • Natural language constraints • can be easily incorporated. • Lack of punctuation and search • space size pose problems. • Speech recognition typically • produces a word-level • time-aligned annotation. • Time alignments for other levels • of information also available.

Implementation IssuesSearch Is Resource Intensive • Typical LVCSR systems have about 10M free parameters, which makes training a challenge. • Large speech databases are required (several hundred hours of speech). • Tying, smoothing, and interpolation are required.

Implementation IssuesDynamic Programming-Based Search • Dynamic programming is used to find the most probable path through the network. • Beam search is used to control resources. • Search is time synchronous and left-to-right. • Arbitrary amounts of silence must be permitted between each word. • Words are hypothesized many times with different start/stop times, which significantly increases search complexity.

Implementation IssuesCross-Word Decoding Is Expensive • Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. • Cross-word decoding significantly increases memory requirements.

Implementation IssuesDecoding Example

Implementation IssuesInternet-Based Speech Recognition

TechnologyConversational Speech • Conversational speech collected over the telephone contains background • noise, music, fluctuations in the speech rate, laughter, partial words, • hesitations, mouth noises, etc. • WER has decreased from 100% to 30% in six years. • Laughter • Singing • Unintelligible • Spoonerism • Background Speech • No pauses • Restarts • Vocalized Noise • Coinage

TechnologyAudio Indexing of Broadcast News • Broadcast news offers some unique • challenges: • Lexicon: important information in • infrequently occurring words • Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) • Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”) • Language: multilingual systems? language-independent acoustic modeling?

TechnologyReal-Time Translation • Imagine a world where: • You book a travel reservation from your cellular phone while driving in • your car without ever talking to a human (database query) • You converse with someone in a foreign country and neither speaker • speaks a common language (universal translator) • You place a call to your bank to inquire about your bank account and • never have to remember a password (transparent telephony) • You can ask questions by voice and your Internet browser returns • answers to your questions (intelligent query) • From President Clinton’s State of the Union address (January 27, 2000): • “These kinds of innovations are also propelling our remarkable prosperity... • Soon researchers will bring us devices that can translate foreign languages • as fast as you can talk... molecular computers the size of a tear drop with the • power of today’s fastest supercomputers.” • Human Language Engineering: a sophisticated integration of many speech and • language related technologies... a science for the next millennium.

TechnologyFuture Directions Hidden Markov Models Dynamic Time-Warping Analog Filter Banks 2000 1990 1980 1970 1960 • What have we learned? • supervised training is a good machine learning technique • large databases are essential for the development of robust statistics • What are the challenges? • discrimination vs. representation • generalization vs. memorization • pronunciation modeling • human-centered language modeling • What are the algorithmic issues for the next decade: • Better features by extracting articulatory information? • Bayesian statistics? Bayesian networks? • Decision Trees? Information-theoretic measures? • Nonlinear dynamics? Chaos?

To Probe FurtherReferences Journals and Conferences: [1] N. Deshmukh, et. al., “Hierarchical Search for LargeVocabulary Conversational Speech Recognition,” IEEE Signal Processing Magazine, vol. 1, no. 5, pp. 84- 107, September 1999. [2] N. Deshmukh, et. al., “Benchmarking Human Performance for Continuous Speech Recognition,” Proceedings of the Fourth International Conference on Spoken Language Processing, pp. SuP1P1.10, Philadelphia, Pennsylvania, USA, October 1996. [3] R. Grishman, “Information Extraction and Speech Recognition,” presented at the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, USA, February 1998. [4] R. P. Lippmann, “Speech Recognition By Machines and Humans,” Speech Communication, vol. 22, pp. 1-15, July 1997. [5] M. Maybury (editor), “News on Demand,” Communications of the ACM, vol. 43, no. 2, February 2000. [6] D. Miller, et. al., “Named Entity Extraction from Broadcast News,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999. [7] D. Pallett, et. al., “Broadcast News Benchmark Test Results,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999. [8] J. Picone, “Signal Modeling Techniques in Speech Recognition,” IEEE Proceedings, vol. 81, no. 9, pp. 1215- 1247, September 1993. [9] P. Robinson, et. al., “Overview: Information Extraction from Broadcast News,” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999. [10] F. Jelinek, Statistical Methods for Speech Recognition,MIT Press, 1998. URLs and Resources: [11] “Speech Corpora,” The Linguistic Data Consortium, http://www.ldc.upenn.edu. [12] “Technology Benchmarks,” Spoken Natural Language Processing Group, The National Institute for Standards, http://www.itl.nist.gov/iaui/894.01/index.html. [13] “Signal Processing Resources,” Institute for Signal and Information Technology, Mississippi State University, http://www.isip.msstate.edu. [14] “Internet- Accessible Speech Recognition Technology,” http://www.isip.msstate.edu/projects/speech/index.html. [15] “A Public Domain Speech Recognition System,” http://www.isip.msstate.edu/projects/speech/software/index.html. [16] “Remote Job Submission,” http://www.isip.msstate.edu/projects/speech/experiments/index.html. [17] “The Switchboard Corpus,” http://www.isip.msstate.edu/projects/switchboard/index.html.

Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem.

Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem.

Presentation Transcript

Free Speech/1 st Amendment

Speech Recognition

Speech Coding EE 516 Spring 2009

Reconstructing Spontaneous Speech

Parts of Speech

Outward bound: Language as a property of the speech community

Occupational and Speech Therapy: Treating children with ASD

Why Inner Speech?

Parts of Speech

Laryngeal Function and Speech Production

Speech Acts

Robust Translation of Spontaneous Speech: A Multi-Engine Approach

A Tutorial on Bayesian Speech Feature Enhancement

Klaus J. Kohler University of Kiel, Germany

Language and Speech Technology: Introduction

Clear and present danger (test)

Design and Implementation of Speech Recognition Systems

Reconstructing Spontaneous Speech

Feature Extraction for speech applications

Transmission Systems and the Telephone Networks

Conditional Random Fields for Automatic Speech Recognition

Novel Speech Recognition Models for Arabic