340 likes | 670 Views
Human and Machine Performance in Speech Processing. Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA). IFA Herengracht 338 Amsterdam. welcome. Heraeus-Seminar
E N D
Human andMachine Performancein Speech Processing Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, The Netherlands (Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA)
IFA Herengracht 338 Amsterdam welcome Heraeus-Seminar “Speech Recognition and Speech Understanding” April 3-5, 2000, Physikzentrum Bad Honnef, Germany
Overview • Phonetics and speech technology • Do recognizers need ‘intelligent ears’? • What is knowledge? • How good is human/machine speech recogn.? • How good is synthetic speech? • Pre-processor characteristics • Useful (phonetic) knowledge • Computational phonetics • Discussion/conclusions
Machine performancemore difficult, if …….. • test condition deviates from training condition, because of: • nativeness and age of speakers • size and content of vocabulary • speaking style, emotion, rate • microphone, background noise, reverberation, communication channel • nonavailability of certain features • however, machines get never tired, bored or distracted
Do recognizers needintelligent ears? • intelligent ears front-end pre-processor • only if it improves performance • humans are generally better speech processors than machines, perhaps system developers can learn from human behavior • robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)
What is knowledge? • phonetic knowledge • probabilistic knowledge from databases • fixed set of features vs. adaptable set • trading relations, selectivity • knowledge of the world, expectation • global vs. detailed see video (with permission from Interbrew Nederland NV)
Video is a metaphor for: • from global to detail (world Europe Holland North Sea coast Scheveningen beach young lady drinking Dommelsch beer) • sound speech speaker English utterance • ‘recognize speech’ or ‘wreck a nice beach’ • zoom in on whatever information is available • make intelligent interpretation, given context • beware for distracters!
Human auditory sensitivity • stationary vs. dynamic signals • simple vs. spectrally complex • detection threshold • just noticeable differences
Detection thresholds and jnd multi-harmonic, simple, stationary signals single-formant-like periodic signals F2 3 - 5% frequency 1.5 Hz BW 20 - 40% Table 3 in Proc. ICPhS’99 paper
DL for short speech-like transitions complex simple short longer trans. Adopted from van Wieringen & Pols (Acta Acustica ’98)
How good ishuman / machine speech recognition? • machine SR surprisingly good for certain tasks • machine SR could be better for many others • robustness, outliers • what are the limits of human performance? • in noise • for degraded speech • missing information (trading)
Human word intelligibility vs. noise humans start to have some trouble recognizers have trouble! Adopted from Steeneken (1992)
Robustness to degraded speech • speech = time-modulated signal in frequency bands • relatively insensitive to (spectral) distortions • prerequisite for digital hearing aid • modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz • temporal smearing of envelope modulation • ca. 4 Hz max. in modulation spectrum syllable • LP>4 Hz and HP<8 Hz little effect on intelligibility • spectral envelope smearing • for BW>1/3 oct masked SRT starts to degrade (for references, see paper in Proc. ICPhS’99)
Robustness to degraded speechand missing information • partly reversed speech (Saberi & Perrott, Nature, 4/99) • fixed duration segments time reversed or shifted in time • perfect sentence intelligibility up to 50 ms (demo: every 50 ms reversed original ) • low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum • syllable as information unit? (S. Greenberg) • gap and click restoration (Warren) • gating experiments
How good is synthetic speech?(not main theme of this seminar, however, still attention for synthesis and dialogue) • good enough for certain applications • could be better in most others • evaluation: application-specific • or multi-tier required • interesting experience: Synthesis workshop at Jenolan Caves, Australia, Nov. 1998
Workshop evaluation procedure • participants as native listeners • DARPA-type procedures in data preparations • balanced listening design • no detailed results made public • 3 text types • newspaper sentences • semantically unpredictable sentences • telephone directory entries • 42 systems in 8 languages tested
Some global results • it worked!, but many practical problems (for demo see http://www.fon.hum.uva.nl) • this seems the way to proceed and to expand • global rating (poor to excellent) • text analysis, prosody & signal processing • and/or more detailed scores • transcriptions subjectively judged • major/minor/no problems per entry • web site access of several systems (http://www.ldc.upenn.edu/ltts/)
Phonetic knowledge to improve speech synthesis (supposing concatenative synthesis) • control emotion, style, voice characteristics • perceptual implications of • parameterization (LPC, PSOLA) • discontinuities (spectral, temporal, prosody) • improve naturalness (prosody!) • active adaptation to other conditions • hyper/hypo, noise, comm. channel, listener impairment • systematic evaluation
Desired pre-processor characteristicsin Automatic Speech Recognition • basic sensitivity for stationary and dynamic sounds • robustness to degraded speech • rather insensitive to spectral and temporal smearing • robustness to noise and reverberation • filter characteristics • is BP, PLP, MFCC, RASTA, TRAPS good enough? • lateral inhibition (spectral sharpening); dynamics • what can be neglected? • non-linearities, limited dynamic range, active elements, co-modulation, secondary pitch, etc.
Caricature of present-day speech recognizer • trained with a variety of speech input • much global information, no interrelations • monaural, uni-modal input • pitch extractor generally not operational • performs well on average behavior • does poorly on any type of outlier (OOV, non-native, fast or whispered speech, other communication channel) • neglects lots of useful (phonetic) information • heavily relies on language model
Useful (phonetic) knowledge neglected so far • pitch information • (systematic) durational variability • spectral reduction/coarticulation (other than multiphone) • intelligent selection from multiple features • quick adaptation to speaker, style & channel • communicative expectations • multi-modality • binaural hearing
Useful information: durational variability Adopted from Wang (1998)
Useful information: durational variability overall average=95 ms normal rate=95 primary stress=104 word final=136 utterance final=186 Adopted from Wang (1998)
Useful information:V and C reduction, coarticulation • spectral variability is not random but, at least partly, speaker-, style-, and context-specific • read - spontaneous; stressed - unstressed • not just for vowels, but also for consonants • duration • spectral balance • intervocalic sound energy difference • F2 slope difference • locus equation
C-duration C error rate Mean consonant duration Mean error rate for C identification 791 VCV pairs (read & spontan.; stressed & unstr. segments; one male) C-identification by 22 Dutch subjects Adopted from van Son & Pols (Eurospeech’97)
Other useful information: • pronunciation variation (ESCA workshop) • acoustic attributes of prominence (B. Streefkerk) • speech efficiency (post-doc project R. v. Son) • confidence measure • units in speech recognition • rather than PLU, perhaps syllables (S. Greenberg) • quick adaptation • prosody-driven recognition / understanding • multiple features
Speech efficiency • speech is most efficient if it contains only the information needed to understand it: “Speech is the missing information” (Lindblom, JASA ‘96) • less information needed for more predictable things: • shorter duration and more spectral reduction for high-frequent syllables and words • C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
Correlation between consonant confusion and 4 measures indicated Dutch male sp. 20 min. R/S 12 k syll. 8k words 791 VCV R/S 308 lex. str. (+) 483 unstr. (–) C ident. 22 Ss + p 0.01 p 0.001 Adopted from van Son et al. (Proc. ICSLP’98)
Computational Phonetics(first suggested by R. Moore, ICPhS’95 Stockholm) • duration modeling • optimal unit selection (like in concatenative synthesis) • pronunciation variation modeling (SpeCom Nov. ‘99) • vowel reduction models • computational prosody • information measures for confusion • speech efficiency models • modulation transfer function for speech
Discussion / Conclusions • speech technology needs further improvement for certain tasks (flexibility, robustness) • phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that • phonetics and speech / language technology should work together more closely, for their mutual benefit • this Heraeus-seminar is a possible platform for that discussion