1 / 29

Introduction

Introduction. C.V. Juan Arturo Nolazco-Flores Associate Professor Computer Science Department, ITESM, campus Monterrey, México Courses: Speech Processing, Computer Networks. E-mail: jnolazco@itesm.mx . Office: Biocomputing Phone: ext. 2726. Plan of Work.

sibley
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction

  2. C.V. • Juan Arturo Nolazco-Flores • Associate Professor Computer Science Department, ITESM, campus Monterrey, México • Courses: • Speech Processing, Computer Networks. • E-mail: jnolazco@itesm.mx. • Office: Biocomputing • Phone: ext. 2726

  3. Plan of Work • Introduction (0.5 hours) • Signal Processing and Analysis Methods (1 hours) • Bank-of-filters • Windowing • LPC • Cepstral Coefficients • Vector Quantization

  4. Plan of work • Speech Recognition (6 hours) • HMM Basics (0.5 hours) • Isoltated Word Recognition (1.5 hour) • Acoustid Modeling using HMM • Evaluation • Training D-HMM, CD-HMM • Language Model • Continuous Word Recognition Using HMM (1.0 hour) • Evaluation • Training D-HMM, CD-HMM • Language Model

  5. Introduction • What is speech recognition? • It is the identification of words in an utterance (speech-> orthographic transcription). • Based on pattern matching techniques. • Knowledge learn from data, usually using a stochastic techniques. • It uses powerful algorithms to optimise a mathematical model for a given task.

  6. Notes • Do not confuse with speech understanding, which is the identification of the utterance meaning. • Do not confuse with speaker recognition: • Do not confuse with speaker identification, which is the identification of a speaker in a set of speakers. • Main Problem: The speaker do not want to be identified. • Do not confuse with speaker verification, which verifies if a speaker is the one he (she) say he (she) is. • Main Problem: The speaker can have a pharyngeal problem.

  7. Database text speech text text text Scoring Speech Recongition System Modules ASR Architecture

  8. Speech Recognition Disciplines • Signal Processing: Spectral analysis. • Physics (Acoustics): Human Hearing studies. • Pattern Recognition: Data clustering. • Communication and Information Theory: statistical models, Viterbi algorithms, etc. • Linguistics: grammar and language parsing. • Physiology: knowledge based systems. • Computer Science: efficient algorithms, UNIX, c language.

  9. Task classification

  10. History (50’s and 60’s) • Speaker Dependent, • Isolated Digit Recognition System Bell Labs, 1952. • Phone recogniser (4 vowels and 9 consonants) (UCL, 1959). • Speaker Independent • 10 vowels recognition (MIT, 1959). • Hardware Implementation of small I-SD (60s, Japan).

  11. History • DTW (Variability) and LPC in ASR (70’s). • Connect word Recognition (80’s). • HMMs and Neural Networks in ASR. • Large vocabulary, continuous ASR. • Standard Databases (90’s): • DARPA (Defence Advanced Research Projects Agency) project, 1000-word database. • World Stree Journal (reading Database) • http://www.ldc.upenn.edu/Catalog/ • Spontaneous Speech (90’s)

  12. Database text speech text text text Scoring

  13. Database • Contains • Waveform files store in a specific format (i.e. PCM, micro-law, A-law, GSM). • SPHINX: /net/alf33/usr1/databases/TELEFONICA_1/WAV_FILES • Every waveform files has a Transcription file (either phonemes, words ). • SPHINX: • ../../training_input/train.lsn

  14. History (at Present) • Domain Dependent ASR. • Experimentation with new stochastic modelling. • Speech Recognition in Noise. • Speech Recognition for Distorted Speech (Cellular Phones, VoIP). • Experimentation with new way to caracterize the speech.

  15. Why ASR is difficult? • Speech is a complex combination of information from different levels that is used to convey an information. • Signal variability: • Intra-speaker variablity • emotional state, environment (Lombard effect) • Inter-speaker variablity • physiological differences, accent, dialect, etc. • Acoustic channel • Telephone channel, background noise/speech, etc.

  16. Database text speech text text text Scoring

  17. Acoustic Processing (Speech Processing Front End) Convert the speech waveform in some type of parametric representation. sk Speech Processing Front End • Parametric Representation: • Zero crossing rate, • Short time Energy, • Short time spectral envelope, etc.

  18. Speech Analysis • What can we observe from this speech waveform:

  19. Speech Signal is non-stationary signal.

  20. Acoustic Processing (Signal Processing Front End) Convert the speech waveform in some type of parametric representation. sk Speech Processing Front End Time-dependent Parametric Representation. O(t,features)

  21. Examples • Articulation position vs. time • Signal Power vs. time (Cohen time-frequency Analysis). • However, trying to obtain the changes in the continuous feature and time spaces is impossible (making some assumptions is possible, but many of the results are not useful for engineering point of view, i.e. Negative Power Spectrums).

  22. Fortunately, if we take small segments of speech, then we can think the speech is stationary in this small segments (quasi-stationary).

  23. Short Time Analysis (Discrete time Time-Frequency function) o(1) o(2) o(3) o(4)

  24. Changing time-resolution o(1) o(2) o(3) o(4) o(5) o(6) o(7) o(8)

  25. In speech, normally • the size of the segments is between 15 and 25 msec. • The sampling time is 10msec

  26. Acoustic Processing (Signal Processing Front End) sk Signal Processing Front End O=o(1)o(2)..o(T) Where: o(t)=[o(t,f_1), o(t,f_2)…o(t,f_P)]

  27. Acoustic Processing (Signal Processing Front End) sk Physiological Modelling Processing Signal Processing Front End MFCC Processing LP Cepstra Processing O=o(1)o(2)..o(T)

  28. Dynamic Features • In order to incorporate dynamic features of the speech (context information of the speech), the first and/or second derivative can be used. • For example:

  29. Acoustic Processing (Signal Processing Front End) sk Signal Processing Front End O=o(1)o(2)..o(T) Where: o(t)=[o(t,f_1), o(t,f_2)…o(t,f_P), o’(t,f_1), o’(t,f_2)…o’(t,f_P), o”(t,f_1), o”(t,f_2)…o”(t,f_P)]

More Related