Dealing with Unknown Unknowns (in Speech Recognition) Hynek H ermansky

Dealing with Unknown Unknowns (in Speech Recognition) Hynek Hermansky Processing speech in multiple parallel processing streams, which attend to different parts of signal space and use different strengths of prior top-down knowledge is proposed for dealing with unexpected signal distortions and with unexpected lexical items. Some preliminary results in machine recognition of speech are presented.

There are things we do not know we don't know. Donald Rumsfeld white man indian

Letter to Editor J.Acoust.Soc.Am. Research field of “mad inventors or untrustworthy engineers” “Funding artificial intelligence is real stupidity” "After growing wildly for years, the field of computing appears to be reaching its infancy.” • supervised the Bell Labs team which built the first transistor • President’s Science Advisory Committee • developed the concept of pulse code modulation • designed and launched the first active communications satellite .... should people continue work towards speech recognition by machine ? Perhaps it is for people in the field to decide.

Why am I working in this field? Spoken language is one of the most amazingaccomplishments of human race. • Why did I climbed Mt. Everest? • Because it is there ! • -Sir Edmund Hilary • access to information • voice interactions with machines • extracting information from speech data ! Problems faced in machine recognition of speech reveal basic limitations of all information technology !

We speak in order to hear, in order to be understood. -Roman Jakobson production, perception, cognition,.. knowledge data Speech recognition …a problem of maximum likelihood decoding -Frederick Jelinek Hidden Markov Model

Stochastic recognition of speech Ŵ = argmaxWp(x|W) P(W) Ŵ – estimated speech utterance p(x|Wi) - likelihoods of acoustic models of speech sounds, the models are derived by training on very large amounts of speech data P(W) - prior probabilities of speech utterances (language model), model estimated from large amounts of data (typically text) • “Unknown unknowns” in machine recognition of speech • distortions not seen in the training data of the acoustic model • words that are not expected by the language model

One possible way of dealing with unknown unknowns Information in speech is coded in many redundant dimensions. Not all dimensions get corrupted at the same time. • Parallel information-providing streams, each carrying different redundant dimensions of a given target. • A strategy for comparing the streams. • A strategy for selecting “reliable” streams. information fusion signal decision • Comparing the streams ? • various correlation (distance) measures Stream formation • Different perceptual modalities • Different processing channels within each modality • Bottom-up and top-down dominated channels Selecting reliable streams ?????

Perceptual Data Fletcher et al Probability of error of recognition of full-band speech is given by a product of probabilities of errors in subbands Boothroyd and Nittrouer Probability of error of recognition in contexts is given by a product of probabilities of errors of recognition without context and probability of error in channel which provides information about the context Final error dominated by the channel with smallest error !

Evidence for different processingstrategies Processing streams Auditory cortical receptive fields different carrier frequencies • A large number of parallel processing streams • Different carrier frequencies • Different carrier bandwidths • Different spectral and temporal resolutions • Different modalities • Different prior biases different temporal resolutions different spectral resolutions frequency time [s] from N. Mesgarani

Evidence for equally powerful bottom-upand top-down streams ? From the subjective point of view, there is nothing special that would differentiate between the top-down and bottom-up dominated processing streams. All streams provide information for a decision. When all streams provide non-conflicting information, all this information is used for the decision. When the context allows for multiple interpretations of the sensory input, the bottom-up processing stream dominates. When the sensory input gets corrupted by noise, the top-down dominated stream fills in for the corrupted bottom-up input. Hermansky 2013

Monitoring Performance Pmiss= (1-P1)(1-P2) P1 P2 observer - false positives and negatives are possible Pmiss_observed ≠ (1-P1)(1-P2) Could it be that we know when we know ?

Performance Monitoring in Sensory Perception human judgment (adopted from Smith et al 2003) similar data available for monkeys, dolphins, rats,… Machine ? sparse not sure training data model of the output 100 % classifier dense update compare models judgement Knowing when one knows ! testing data classifier model of the output 0 % picture density low high

data Spectrogram Posteriogram up to 1 s artificial neural network trained on large amounts of labeled data preprocessing frequency time phoneme posteriors ANN fusion

Fusion of streams of different carrier frequencies[Hermansky et al 1996, Li et al 2013]

Preliminary results using multi-stream speech recognition on noisy TIMIT data • Processing is done in multiple parallel streams • Signal corruption affects only some streams • Performance monitor selects N best streams for further processing Phoneme recognition error rates on noisy TIMIT data

conventional “deep” net many processing layers up to 100 ms (transformed) posterior probabilities of speech sounds all available frequency components time “long, wide and deep”net many processing layers up to 1000 ms high frequency components get info1 (transformed) posterior probabilities of speech sounds mid frequency components “smart” fusion get infoi low frequency components get infoN time

Conclusions we would eventually like to make • Recognition should be done in parallel processing streams, each attending to a particular aspect of the signal and using different levels of top-down expectations • Discrepancy among the streams indicates an unexpected signal • Suppressing corrupted streams can increase robustness to unexpected inputs

Machine Emulation of Human Speech Communication John Pierce ..devise a clear, simple, definitive experiments. So a science of speech can grow, certain step by certain step. Fred Jelinek Speech recognition …a problem of maximum likelihood decoding tools information and communication theory, machine learning, large data,…. Roman Jakobson We speak, in order to be heard, in order to be understood also John Pierce: (Speech recognition is so far (1969) field of) mad inventors or untrustworthy engineers (because machine needs) intelligence and knowledge of language comparable to those of a native speaker . human communication, speech production, perception, neuroscience, cognitive science,.. Gordon Moore The complexity for minimum component costs has increased at a rate of roughly a factor of two per year… Sounds like a good goal to aim at !

THANKS ! HamedKetabdar MishaPavel Jont Allen NimaMesgarani Feipeng Li Vijay Peddinti EhsanVariani Harish Mallidi Samuel Thomas

Dealing with Unknown Unknowns (in Speech Recognition) Hynek H ermansky

Dealing with Unknown Unknowns (in Speech Recognition) Hynek H ermansky

Presentation Transcript

Speech Recognition and its clinical applications

Speech Recognition

Speech Recognition Introduction II

Speech Recognition

Speech Recognition with CMU Sphinx

Speech recognition

Speech in Multimedia

Dealing with Connected Speech and CI Models

ISSUES IN SPEECH RECOGNITION Shraddha Sharma

Chapter 7 Speech Recognition Framework

Speech Recognition Introduction I

7- Speech Recognition (Cont’d)

Course Projects Speech Recognition Spring 1386

Search and Decoding in Speech Recognition

A 12-WEEK PROJECT IN Speech Coding and Recognition

Speech Recognition

7- Speech Recognition (Cont’d)

Fundamentals of Speech Recognition

Dealing with Acoustic Noise Part 3: Video

Dealing with Connected Speech and CI Models

EEC-693/793 Applied Computer Vision with Depth Cameras

Speech Recognition Introduction I