Progress & Challenges in Automatic Recognition and Understanding of Spoken Language

Progress & Challenges inAutomatic Recognition and Understanding of Spoken Language B.H. Juang bjuang@lucent.com Sadaoki Furui furui@cs.titech.ac.jp

Purpose of the Workshop • To find out: • Where we are - What problems have been solved? • What problems are still open? • How to use the current technology? • Which way to go - Which research direction to take? • What needs to be done? • How to get there? Juang, Workshop-2000, Summit, NJ

Motivation …. • Worldwide investment in speech recognition and synthesis is • estimated at $400M annually - 2000 people in the field. • IBM demonstrated Chinese speech dictation software at the • Great Hall of People in China - speech software is ubiquitous. • Trade magazines cautioned users to lower expectations of • PC voice-recognition software - “Treat it like your dog”? • Application programmers complain about the “bugs” - same • voice commands, different results at different time. • Many people turn off PC speech recognition or synthesis features • after < 1 week of use - not sticky enough? • So what is going on? Juang, Workshop-2000, Summit, NJ

General Landscape & Talk Outline Human Machine Interaction Spoken Langauge Processing Speech Recognition & Understanding Statistical & Computational Methods • Progress in spoken language technology • Paths passed • Forward looking issues Juang, Workshop-2000, Summit, NJ

100% Switchboard Conversational Speech foreign Read Speech WSJ Broadcast Speech Spontaneous Speech Varied Microphone  20k foreign   ATIS NAB WORD ERROR RATE  10%  5k  Noisy 1k        Resource Management Courtesy NIST 1999 DARPA HUB-4 Report, Pallett et al. 1% 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 History of DARPA Speech Recognition Benchmark Tests Juang, Workshop-2000, Summit, NJ

Spontaneous Speech natural conversation 2-way dialog word spotting Fluent Speech transcription system driven dialog network agent & intelligent messaging digit strings Speaking Style name dialing Read Speech 2000 office dictation form fill by voice speaker verification Connected Speech 1998 1980 directory assistance voice commands Isolated Word 1990 2 20 200 2000 20000 Vocabulary Size Speech/Speaker Recognition Technology Juang, Workshop-2000, Summit, NJ

ASR Technology Progress high Deployable performance threshold ASR performance range Relative performance reliability (accuracy, range of operating conditions) 200X 1990’s 1980’s low Conversation by voice Flexibility & Degree of Difficulty (vocabulary size, perplexity, speaking style) Few words in isolation Juang, Workshop-2000, Summit, NJ

Challenge - Issue #1 • Which applications are viable given the current state of the art? • Sub-issues: • What are the deployable thresholds for various applications? • Which applications can be supported by the current • technology? • Are we seeing widespread use of these applications? If not, • why not? • Is it possible to categorize these applications in terms of the • value proposition to the users as well as the providers? If • yes, how? • What are the most useful applications of spoken language • technologies that have yet to be developed? And why? Juang, Workshop-2000, Summit, NJ

Turning Sounds into Words - Current Norm • X = acoustic signal sequence; W = word sequence • P(W|X) = PX(X|W) Pw(W) / P(X) • objective: maximize the average performance (accuracy rate) • max P(W|X) during training • maxW P(W|X) during decoding •  •  •  • Pw(W) • statistical language models (mostly for large vocabulary ASR) • grammar expressions (finite-state, context-free, ..) • PX(X|W) • hidden Markov model • mixture density - close approx. to arbitrary distribution • Data-driven methods led to major advances in speech recognition. Juang, Workshop-2000, Summit, NJ

The Path We Have Walked Through System design based on distribution estimation & the need of a distribution formalism that closely support the data (HMM, mixture density, N- gram LM) Spoken language recognition & understanding Refinement on use of data and the matching model (adaptation algorithms) Problem formulation based on Bayes decision theory and adoption of error rate as the performance criterion Recognition of words, phrases, and meaning from a finite set of choices Identification of linguistic events by association & critical feature, with assumed structure Template matching & use of reference patterns as representation of knowledge with embedded critical features (distance measure, Cepstrum) Refinement on distribution structure (phone model generalization) Dynamic programming algorithms with broadened search scope & improved efficiency (Level-building, stack algorithm, Beam search, N-best) Study of acoustic-phonetics & liguistic theories to establish knowledge for performing identification (spectrogram reading, A-P systems, invariant features) Adoption of general performance criteria and optimization methods (discriminative modeling) Treatment of temporal variation (dynamic programming time warping) Juang, Workshop-2000, Summit, NJ

Role of Bayes Decision Theory Bayes Decision Theory “non-parametric” “parametric” Template Matching & Clustering, Feature Representation, Temporal Alignment, Critical Events Distribution Estimation, Hidden Markov Model, Language Models, Search Algorithms Differentiation: - Parameterization; - Involvement of language; - Emphasis of temporal resolution & critical events; - handling of search context & complexity compromises Issues: form of distribution; representation of spoken language; representation of context; adaptation of distribution & structural expression treatments of critical events Generalized optimization methods for system design support parametric as well as non-parametric approach minC mine{C(x, )} Juang, Workshop-2000, Summit, NJ

trans - message linguistic articulatory acoustic mission source channel channel channel channel Automatic Speech Recognition and Understanding - A Communication-Theoretic Approach P(M) P(W|M) P(S|W) P(A|S) P(X|A) speech recognizer M W S A X message M realized as a word sequence W words realized as a sequence of sounds S sounds received by transducer through acoustic ambient A signal converted from acoustic to electric, transmitted, distorted and received for processing as X Focus of traditional acoustic- phonetic study ? ? Juang, Workshop-2000, Summit, NJ

Assumptions in the Communication-Theoretic Approach • A classification & decision theoretic framework • - prescribed finite alphabet; • - prescribed finite set of understandable concepts • Domain knowledge exists and is shared between system • builder & user • - the alphabet • - the vocabulary • - the set of “concepts” • - the expected actions Juang, Workshop-2000, Summit, NJ

Challenge - Issue #2 • Is this formulation sufficient for human-machine interaction? • Sub-issues: • Given the limitation of this formulation, can we still build • something useful for human-machine interaction? • How to manage the domain knowledge? How rigid is it? • How to use the prior knowledge? How to adapt? • Can this formulation be broadened, to become a general • framework for the design of a communicating machine? • If not, what is the alternative? Can it be a module of a • bigger system? • What is the significance of inference & other logical • operations in spoken language communication? • importance of pragmatics & prosody • knowledge representation Juang, Workshop-2000, Summit, NJ

Challenge - Issue #3 • Within the computationally feasible framework, what is the • objective of the system design and how to choose it? • recognition • choosing the right one from a finite set of options • Bayes decision theory • identification • the ability to register and apply the knowledge of critical • feature to identification of linguistic events • detection & extraction • determining if target event is present and where it is • Neymann-Pearson Lemma • Different goals entail different solution formulation. Juang, Workshop-2000, Summit, NJ

Challenge - Issue # 4 • How to measure the performance of the system and how • to incorporate the performance metric in the solution? • Performance objectives • error rate (overall average, average over talkers, critical user) • information transfer rate (b/s) between human and machine • other measures (degree of satisfaction, productivity, recall, ..) Performance Objective affects the formulation of the problem. Human factor/engineering is necessary in defining the objective. Juang, Workshop-2000, Summit, NJ

Difficulties in Automatic Speech Recognition • lack of systematic understanding of variability • - structural or functional variability • - parametric variability • lack of complete structural representations of speech • lack of data for understanding of non-structural variability Juang, Workshop-2000, Summit, NJ

Transducer Characteristics in Telephone Sets - ICASSP-93, Wang, Chen & Yang - dB Frequency (kHz) Juang, Workshop-2000, Summit, NJ

Electroacoustic Transducer Wired or wireless transmission Transmission Channel To Speech Recognizer Acoustic Signal Microphone frequency response Transmission line Response; Codec; Packet loss/delay Juang, Workshop-2000, Summit, NJ

Transformation of Distribution Cepstral Vector Space A A Cha Cha C C Translation, scaling & rotation Translation • Issues: • How can linear transformation and bias be reliably and efficiently • estimated? • Is it possible to have an invariant feature that does not change with • operating and environmental conditions? Juang, Workshop-2000, Summit, NJ

S Ha Hc   Na Nc Acoustic Channel Transmission Channel The Usual Transmission Channel The acoustic signal is convolved with the channel response Hc and contaminated with channel noise Nc. • Characteristics of transfer functions of acoustic and transmission • channels and those of the noise are very different; • A simplified model, lumping various channel components together, • is usually used; it may be inadequate; • Mixing non-linear auditory and cognitive features, and linear • channel models is not straightforward. Juang, Workshop-2000, Summit, NJ

radio Base Station & MTSO PSTN Switch GSM, TDMA, CDMA 64kb/s /A-PCM Word Accuracy (%) SNR (dB) speed (mph) speech feature 98.6 98.5 97.6 96.9 98.5 17 60 30 13 60 30 no TDMA codec 95.2 96.8 92.3 96.1 98.5 Impact of Non-linear Distortion Due to CODEC network Coding effect only Coding+error concealment • Issues: • coding distortion; trans-coding and tandeming • possibility of data-tunneling (wireless codec bits embedded in DS0 • without decoding into PCM during transmission) Juang, Workshop-2000, Summit, NJ

Challenge - Issue #5 • Are there alternative structural representations of speech beyond • hidden Markov model and finite state grammar that are suitable • for handling stochastic variation to achieve robust performance? • Speech representations & adaptation • front-end measurements & representation • role of critical events and feature • sequence representation • pronunciation • syntax, context, and grammar • concept/message Juang, Workshop-2000, Summit, NJ

COMPOSITE FSN                sil sh ow sil aw l sil Beginning state                 ax l er t s sil Final state • Example: Dealing with Phonemic Variations • enrich the distribution form - use mixture density HMM • introduce context - use context-dependent phoneme-like models context-dependent phoneme model examples: -sh-ow, -ax-l, ax-l-er, l-er-t Issue: what contexts to consider? Juang, Workshop-2000, Summit, NJ

Pronunciation Variations • - in a typical Switchboard data set • Reference Dictionary - constructed from Callhome and Switchboard • 3M words training set of 28,000 distinct words, 3500 of which • have multiple pronunciations. • Test Data Set - • 4700 word tokens; 900 distinct words • 2100 pronunciations according to phonetic transcription • 2200 tokens (47%) pronounced “properly” according to dictionary • 1500 new pronunciations emerge for complete coverage • Other attributes: • 650 words with single pronunciation • “the” has 36 pronunciations • schwa is pronunciation of 27 words; 38 pronunciations • are homonymic with more than 5 words • “the” and “to” are most confusable with 7 pronunciations in common • Pronunciation modeling is critically needed. It also affects phoneme models. Juang, Workshop-2000, Summit, NJ

Representations of Sequence • List • W = (w1, w2, w3, …, wL) • Finite State Grammar (V, , G, f, ) • V {w}, the vocabulary or alphabet •  {G}, the set of states • G Gt = f(wt-N, wt-N+1, .., wt-1) • f the “function of context”;(wt-N, wt-N+1, .., wt-1) is the context •  the next state function Gt+1 = ( Gt, wt) • Various grammar rules and inference for V*= Vi   I=1 Juang, Workshop-2000, Summit, NJ

State and Context • Context - condition formed by the text sequence • State - implied characteristics derived from the context • Examples of State: • the context itself, Gt = (wt-N, wt-N+1, .., wt-1) • sequence of the corresponding broad phonetic classes • word class sequence, e.g. verb, adjective, noun, .. • grammar elements, e.g. noun phrase, verb phrase, .. • derived notions, e.g. date, time, number, dialog notions • Issues: • - representation of state and the function of context • - number of states to consider,  = {G} • - coverage of probabilistic measure Juang, Workshop-2000, Summit, NJ

Single Symbol w Representation, Coverage and Probability Measure • Issues: • Symbol dependency may have long span; • W Vi, i large, is hard to access (as hard as • determining if W is a valid sentence); • Grammatical inference tends to over-generate; • Probability assignment/estimation of P(W) or • P(W|G) is difficult. • Question: • Is there an intermediate (variable length) • set {W} or {G} that meets the need? Symbol Pair w w Symbol Triplet w w w Juang, Workshop-2000, Summit, NJ

Challenge - Issue #6 • Can we make spoken language applications easy to develop? • What kind of tools are needed? • Sub-issues • What are the deployment requirements? • How rigidly is the technology tied to the application domain? • How robust is it? • What are the desired characteristics in the application • development environments? • Is it possible to have a common architecture and interface • for application developments? Juang, Workshop-2000, Summit, NJ

Summary • Spoken language processing technology has made significant • progress, with many potential applications. • A communication-theoretic formulation of the speech production • chain provides a framework for extending the present speech • communication research and developments. • Further improvements to the technology need to focus on the issues • of increased accuracy and robustness in performance, natural • human-machine interaction, and ease in application developments. • Current technology developments are based on a “focused” • problem formulation; many fundamental issues remain open in • spoken language recognition and understanding. Juang, Workshop-2000, Summit, NJ

Progress & Challenges in Automatic Recognition and Understanding of Spoken Language