From last time …

From last time …

Grammar RecognizedWords “zero” “three” “two” Cepstrum Probabilities“z” -0.81“th” = 0.15“t” = 0.03 Decoder Signal Processing ProbabilityEstimator ASR System Architecture Speech Signal Pronunciation Lexicon

A Few Points about Human Speech Recognition (See Chapter 18 for much more on this)

Human Speech Recognition • Experiments dating from 1918 dealing with noise, reduced BW (Fletcher) • Statistics of CVC perception • Comparisons between human and machine speech recognition • A few thoughts

The Ear

The Cochlea

Assessing Recognition Accuracy • Intelligibility • Articulation - Fletcher experiments • CVC, VC, CV, syllables in carrier sentences • Tests over different SNR, bands • Example: “The first group is `mav’ (forced choice between mav and nav) • Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.

Results • S = vc2 • Articulation Index (the original “AI”) • Error independence between bands • Articulatory band ~ 1 mm along basilar membrane • 20 filters between 300 and 8000 Hz • A single zero error band -> no error! • Robustness to a range of problems • AI = ∑k 1/K (SNRk / 30) where SNR saturates at 0 and 30

AI additivity • s(a,b) = phone accuracy for band from a to b, a<b<c • (1-s(a,c)) = (1-s(a,b))(1-s(b,c)) • log10(1-s(a,c)) = log10(1-s(a,b)) + log10(1-s(b,c)) • AI(s) = log10(1-s) / log10(1-smax) • AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))

Jont Allen interpretation:The Big Idea • Humans don’t use frame-like spectral templates • Instead, partial recognition in bands • Combined for phonetic (syllabic?) recognition • Important for 3 reasons: • Based on decades of listening experiments • Based on a theoretical structure that matched the results • Different from what ASR systems do

Questions about AI • Based on phones - the right unit for fluent speech? • Lost correlation between distant bands? • Lippmann experiments, disjoint bands • Signal above 8 kHz helps a lot in combination with signal below 800 Hz

Human SR vs ASR: Quantitative Comparisons • Lippmann compilation (see book): typically ~factor of 10 in WER • Hasn’t changed too much since his study • Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)

Human SR vs ASR: Quantitative Comparisons (2) Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise (old numbers – ASR would be a bit better now)

Human SR vs ASR: Qualitative Comparisons • Signal processing • Subword recognition • Temporal integration • Higher level information

Human SR vs ASR: Signal Processing • Many maps vs one • Sampled across time-frequency vs sampled in time • Some hearing-based signal processing already in ASR

Human SR vs ASR: Subword Recognition • Knowing what is important (from the maps) • Combining it optimally

Human SR vs ASR: Temporal Integration • Using or ignoring duration (e.g., VOT) • Compensating for rapid speech • Incorporating multiple time scales

Human SR vs ASR: Higher levels • Syntax • Semantics • Pragmatics • Getting the gist • Dialog to learn more

Human SR vs ASR: Conclusions • When we pay attention, human SR much better than ASR • Some aspects of human models going into ASR • Probably much more to do, when we learn how to do it right

From last time …

From last time …

Presentation Transcript

Time Management

Just-In-Time Manufacturing

Time and Motion Study

Arc Hydro and Time Series

Time Management

Overview of Real -Time PCR

Time to Take It Easy

Chapter 2: Time Value of Money

Time Series Analysis

Chapter 4

Unit Eight

Time Series Analysis: Method and Substance Introductory Workshop on Time Series Analysis

Time in the Weak Value and the Discrete Time Quantum Walk

Real-Time PCR

Time Management

Real-Time PCR

Once Upon A Time In The West U. S. A

Time Space and Time-Space

A Time for Everything

Geologic Time

Chapter 10 Time and Global States