1 / 21

Statistical Speech Technology Group mickey.ifp.uiuc/wiki/

Statistical Speech Technology Group http://mickey.ifp.uiuc.edu/wiki/.

ely
Download Presentation

Statistical Speech Technology Group mickey.ifp.uiuc/wiki/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Speech Technology Grouphttp://mickey.ifp.uiuc.edu/wiki/ Jui-Ting Huang, Arthur Kantor, Mark Hasegawa-Johnson, Sarah Borys, Erica Lynn, Jeremy Tidemann, Xiaodan Zhuang, Bowon Lee, Harsh Vardhan Sharma, Laehoon Kim, Su-Youn Yoon, Kyungtae Kim, Heejin Kim, Bryce Lobdell, Chi Hu, Rahul Yargop, and David Harwath

  2. Current Sources of Funding • Collaborations with Departments of Linguistics and Psychology • NSF 07-03624 RI-Collaborative Research: Landmark-based speech recognition using prosody-guided models of speech variability • NSF 06-23805 DHB: An interdisciplinary study of the dynamics of second language fluency • Collaborations with Dept. of Speech and Hearing Sciences, DRES, and IFP • NSF 05-34106: Audiovisual distinctive-feature-based recognition of dysarthric speech • NIH R21-DC008090A: Description and recognition of audible and visible dysarthric phonology • Collaborations with Depts. of Communications, CS, Civil Engineering, and IFSI • CIRS: Instrumenting research on interaction groups in complex social contexts • Collaborations with Beckman Fellows, Illinois Simulator Lab, and IFP • NSF 08-07329 FODAVA-Partner: Visualizing audio for anomaly detection • NSF 08-03219 RI-Medium: Audio diarization: Towards comprehensive description of audio events

  3. Current Doctoral Theses in Progress • Bryce Lobdell • Models of human speech perception in noise based on intelligibility predictors and information theory • Lae-Hoon Kim • Statistical model based multi-microphone statistical speech processing • Arthur Kantor • ASR using segmental models and context dependent pronunciation models • Xiaodan Zhuang • Acoustic event detection and audio search • Sarah Borys • Auditory modeling for landmark detection • Jui-Ting Huang • Semi-supervised learning of multilingual speech acoustics

  4. Research Overview:Bryce Lobdell An information theoretic analysis of level and spectral shape factors in speech perception

  5. Representation of speech in noise by humans Compare • Task is isolated phone transcription, based on 142,848 human responses. • Human and machine classifications are compared using the Kullback-Liebler metric. • Model speech and noise representations are plausible in light of years of research on human speech perception.

  6. Representation of speech in noise by humans • Two particular representations offer the advantage of noise generality wrt similarity with human responses. • Level resolution seems to be largely irrelevant to human behavior. • Other perceptual experiments suggest that the representation of speech in noise is context dependent.

  7. Research Overview:Lae-Hoon Kim Statistical Model Based Multi-Microphone Statistical Speech Processing

  8. Example situation

  9. Goals • Multi-microphone optimal speech enhancement • Multi-microphone robust speech recognition

  10. Research Overview:Arthur Kantor ASR using segmental models and context dependent pronunciation models

  11. Mistake-driven selection of units of speech in Large-Vocabulary ASR • Co-articulation is a serious problem for speech recognition • The model of concatenated phonemes in the dictionary pronunciation of “i don't know” gives a low likelihood for a common “ahdonno” pronunciation • Typically handled by triphone models: phonetic models that differ based on the phonetic context • The n-t model in “i don't know” is different from the n-t+i model in “dentist”, so the different pronunciations of ‘t’ are more likely in their respective contexts • My approach: • Learn longer-context units in situations where triphones make mistakes. e.g.: AY D OW N T N OW I_DONT_KNOW • The interesting questions are: What units? In which context?

  12. Segmental models - Time shrinking • Speech is typically represented as a sequence of overlapping frames, which are used to train a HMM, where the phoneme is hidden state • Consecutive speech frames are assumed to be independent given the phoneme • This a well-known false assumption. A frame is often similar to its neighbors given the phone. • My approach: • Using a classifier (e.g. neural-net), to classify frames into phones • map stretches of similar frames into a single representative frame, and train/decode on representative frames. • Preliminary results show a 10% relative improvement in Word Error Rate, by throwing out half of the frames

  13. Research Overview:Xiaodan Zhuang Acoustic event detection and audio search

  14. General Event Detection Acoustics: Data-driven feature selection Descriptors inspired by 2D cues in spectrogram Visual cues: Discriminative global description Detection and localized description Modeling for event detection ANN, DBN, HMM supervectors Tandem models, Rescoring Background speech/noise Door slam Door slam Cup Jingle Applause Key Jingle

  15. Thousands Of Audio files Top N file IDs k r u: d p r ai s ^ z “CRUDE PRICES” Audio Archive Spectral Features Feature Extraction onShort-time windows SpeechRecognition Lattices All GroupIndices FSM-based Indexing Group Index FSM-basedRetrieval FSA Generation FSM Index . . . . . . . . . FSA Generation FSM Index Group Index Query FSM-basedQuery Construction FSM-basedQuery Construction RetrievalResults Empirical Phone confusion Knowledge-based Phone confusion Recognizing Gestural Pattern Vectors for ASR motivated by articulatory phonology FSM-based retrieval on multi-lingual/cross-lingual speech Speech features (tract variable time function)=> Gestural Pattern Vector=> gesture score =>words Speech features => Phones => Phone sequence => words WordTriphonePhone Other acoustic units

  16. Research Overview:Sarah Borys Landmark-based speech recognition

  17. Phonetic Features Type Describes… Examples Feature Transition (FT)‏ Changes in perceptually salient characteristics -+continuant, +-sonorant, -+speech Manner Class (MC)‏ Phonetic class nasal, stop, vowel Place of Articulation (PA)‏ Whether or not an a given articulator was used during production alveolar, labial strident, voice, front, high, low, round Vocal tract image borrowed from http://ocw.mit.edu/OcwWeb/Linguistics-and-Philosophy/24-910Spring-2007/CourseHome/

  18. Phone Recognition Accuracy Word Recognition Accuracy

  19. Research Overview:Jui-Ting Huang Semi-supervised learning of speech

  20. Semi-supervised learning of speech • Motivation • Unlabeled data are easy and cheap to obtain • Labeled data can provide guidance • Acoustic data and model • Continuous real feature vectors (spectral features of speech sounds)‏ • Gaussian mixture model • People usually do… • Iteratively transcribe unlabeled speech and retrain the model using newly transcribed data (assuming they have correct labels)‏ • Have to define the confidence threshold for filtering classifier output • We propose to learn from both labeled and unlabeled data simultaneously • The training criterion includes both the labeled and unlabeled set • Purely generative training • Discriminative training on labeled data + generative training on unlabeled data

  21. DL:DU=19.5:1 DL=Male;DU=Female MLE 55.75 54.96 MMIE 55.75 55.28 MMIE+ U-smoothing 58.79 60.31 Semi-supervised learning of speech (cont.)‏ • The objective function to optimize during training: • Results (Phone classification Accuracy in %)‏ • Ongoing work • Extend to Hidden Markov Models in speech recognition • Deep analysis on the relations of the learning capability and the size/characteristic of unlabeled data

More Related