Statistical Speech Technology Group mickey.ifp.uiuc/wiki/

Statistical Speech Technology Grouphttp://mickey.ifp.uiuc.edu/wiki/ Jui-Ting Huang, Arthur Kantor, Mark Hasegawa-Johnson, Sarah Borys, Erica Lynn, Jeremy Tidemann, Xiaodan Zhuang, Bowon Lee, Harsh Vardhan Sharma, Laehoon Kim, Su-Youn Yoon, Kyungtae Kim, Heejin Kim, Bryce Lobdell, Chi Hu, Rahul Yargop, and David Harwath

Current Sources of Funding • Collaborations with Departments of Linguistics and Psychology • NSF 07-03624 RI-Collaborative Research: Landmark-based speech recognition using prosody-guided models of speech variability • NSF 06-23805 DHB: An interdisciplinary study of the dynamics of second language fluency • Collaborations with Dept. of Speech and Hearing Sciences, DRES, and IFP • NSF 05-34106: Audiovisual distinctive-feature-based recognition of dysarthric speech • NIH R21-DC008090A: Description and recognition of audible and visible dysarthric phonology • Collaborations with Depts. of Communications, CS, Civil Engineering, and IFSI • CIRS: Instrumenting research on interaction groups in complex social contexts • Collaborations with Beckman Fellows, Illinois Simulator Lab, and IFP • NSF 08-07329 FODAVA-Partner: Visualizing audio for anomaly detection • NSF 08-03219 RI-Medium: Audio diarization: Towards comprehensive description of audio events

Current Doctoral Theses in Progress • Bryce Lobdell • Models of human speech perception in noise based on intelligibility predictors and information theory • Lae-Hoon Kim • Statistical model based multi-microphone statistical speech processing • Arthur Kantor • ASR using segmental models and context dependent pronunciation models • Xiaodan Zhuang • Acoustic event detection and audio search • Sarah Borys • Auditory modeling for landmark detection • Jui-Ting Huang • Semi-supervised learning of multilingual speech acoustics

Research Overview:Bryce Lobdell An information theoretic analysis of level and spectral shape factors in speech perception

Representation of speech in noise by humans Compare • Task is isolated phone transcription, based on 142,848 human responses. • Human and machine classifications are compared using the Kullback-Liebler metric. • Model speech and noise representations are plausible in light of years of research on human speech perception.

Representation of speech in noise by humans • Two particular representations offer the advantage of noise generality wrt similarity with human responses. • Level resolution seems to be largely irrelevant to human behavior. • Other perceptual experiments suggest that the representation of speech in noise is context dependent.

Research Overview:Lae-Hoon Kim Statistical Model Based Multi-Microphone Statistical Speech Processing

Example situation

Goals • Multi-microphone optimal speech enhancement • Multi-microphone robust speech recognition

Research Overview:Arthur Kantor ASR using segmental models and context dependent pronunciation models

Mistake-driven selection of units of speech in Large-Vocabulary ASR • Co-articulation is a serious problem for speech recognition • The model of concatenated phonemes in the dictionary pronunciation of “i don't know” gives a low likelihood for a common “ahdonno” pronunciation • Typically handled by triphone models: phonetic models that differ based on the phonetic context • The n-t model in “i don't know” is different from the n-t+i model in “dentist”, so the different pronunciations of ‘t’ are more likely in their respective contexts • My approach: • Learn longer-context units in situations where triphones make mistakes. e.g.: AY D OW N T N OW I_DONT_KNOW • The interesting questions are: What units? In which context?

Segmental models - Time shrinking • Speech is typically represented as a sequence of overlapping frames, which are used to train a HMM, where the phoneme is hidden state • Consecutive speech frames are assumed to be independent given the phoneme • This a well-known false assumption. A frame is often similar to its neighbors given the phone. • My approach: • Using a classifier (e.g. neural-net), to classify frames into phones • map stretches of similar frames into a single representative frame, and train/decode on representative frames. • Preliminary results show a 10% relative improvement in Word Error Rate, by throwing out half of the frames

Research Overview:Xiaodan Zhuang Acoustic event detection and audio search

General Event Detection Acoustics: Data-driven feature selection Descriptors inspired by 2D cues in spectrogram Visual cues: Discriminative global description Detection and localized description Modeling for event detection ANN, DBN, HMM supervectors Tandem models, Rescoring Background speech/noise Door slam Door slam Cup Jingle Applause Key Jingle

Thousands Of Audio files Top N file IDs k r u: d p r ai s ^ z “CRUDE PRICES” Audio Archive Spectral Features Feature Extraction onShort-time windows SpeechRecognition Lattices All GroupIndices FSM-based Indexing Group Index FSM-basedRetrieval FSA Generation FSM Index . . . . . . . . . FSA Generation FSM Index Group Index Query FSM-basedQuery Construction FSM-basedQuery Construction RetrievalResults Empirical Phone confusion Knowledge-based Phone confusion Recognizing Gestural Pattern Vectors for ASR motivated by articulatory phonology FSM-based retrieval on multi-lingual/cross-lingual speech Speech features (tract variable time function)=> Gestural Pattern Vector=> gesture score =>words Speech features => Phones => Phone sequence => words WordTriphonePhone Other acoustic units

Research Overview:Sarah Borys Landmark-based speech recognition

Phonetic Features Type Describes… Examples Feature Transition (FT)‏ Changes in perceptually salient characteristics -+continuant, +-sonorant, -+speech Manner Class (MC)‏ Phonetic class nasal, stop, vowel Place of Articulation (PA)‏ Whether or not an a given articulator was used during production alveolar, labial strident, voice, front, high, low, round Vocal tract image borrowed from http://ocw.mit.edu/OcwWeb/Linguistics-and-Philosophy/24-910Spring-2007/CourseHome/

Phone Recognition Accuracy Word Recognition Accuracy

Research Overview:Jui-Ting Huang Semi-supervised learning of speech

Semi-supervised learning of speech • Motivation • Unlabeled data are easy and cheap to obtain • Labeled data can provide guidance • Acoustic data and model • Continuous real feature vectors (spectral features of speech sounds)‏ • Gaussian mixture model • People usually do… • Iteratively transcribe unlabeled speech and retrain the model using newly transcribed data (assuming they have correct labels)‏ • Have to define the confidence threshold for filtering classifier output • We propose to learn from both labeled and unlabeled data simultaneously • The training criterion includes both the labeled and unlabeled set • Purely generative training • Discriminative training on labeled data + generative training on unlabeled data

DL:DU=19.5:1 DL=Male;DU=Female MLE 55.75 54.96 MMIE 55.75 55.28 MMIE+ U-smoothing 58.79 60.31 Semi-supervised learning of speech (cont.)‏ • The objective function to optimize during training: • Results (Phone classification Accuracy in %)‏ • Ongoing work • Extend to Hidden Markov Models in speech recognition • Deep analysis on the relations of the learning capability and the size/characteristic of unlabeled data

Statistical Speech Technology Group mickey.ifp.uiuc/wiki/