430 likes | 1.04k Views
Presented by Erin Palmer. Speech Processing. What constitutes Speech Processing? . Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computer’s dictation feature Amazon’s Kindle (TTS) Cell phone GPS Others? Speech processing:
E N D
Presented by Erin Palmer Speech Processing
What constitutes Speech Processing? • Speech processing is widely used today • Can you think of some examples? • Phone dialog systems (bank, Amtrak) • Computer’s dictation feature • Amazon’s Kindle (TTS) • Cell phone • GPS • Others? • Speech processing: • Speech Recognition • Speech Generation (Text to Speech)
Speech Representation • Text? • Easy: each letter is an entity, words are composed of letters • Computer stores each letter (character) to form words (strings) • Images? • Slightly more complicated: each pixel has RGB values, stored in a 2D array • But what about speech?
Speech Representation • Unit: phoneme • Phoneme is an interval that represents a unit sound in speech • Denoted by slashes: /k/ in kit • In english the correspondance between phonemes and letters is not good • /k/ is the same in kit and cat • /∫/ is the sound for shell
All Phonemes of the English Language: In the English Language there is a total of: 26 letters 43 phonemes
Speech Representation • Waveform • Constructed from raw speech by sampling the air pressure at each point given the frequency (which is dependant on sample rate) • Frequencies are connected by a curve • The signal is quantized, so it needs to be smoothed, and that is the waveform that is output • Spectrogram • Function of amplitude as a function of frequency • time (x-axis) vs. frequency (y-axis) • Using the gray-scale we indicate the energy at each particular point • so color is the 3rd dimension • The areas of the spectrogram look denser, where the amplitudes of the wavelengths are greater • The regions with the greatest wavelengths are the areas where the vowels were pronounced, for example /ee/ in “speech”. • The spectrogram also has very distinct entries for all the phonemes
Speech Representation • Intensity • Measure of the loudness of how one talks • Through the course of a word, the intensity goes up then down • In between words, the intensity goes down to zero • Pitch • Measure of the fundamental frequency of the speaker’s speech • It is measured within one word • The pitch doesn’t change too drastically , • A good way to detect if there is an error, is to see how drastically it changes. • In statements the pitch stays constant, and in a question or in an exclamation, it would go up on the thing that we are asking or on the thing we were exclaiming about.
Wave Form • The wave form is used to do various speech-related tasks on a computer • .wav format • Speech recognition and TTS both use this representation, as all other information can be derived from it
How would a machine recognize speech? • The problem of language understanding is very difficult! • Training is required • What constitutes good training? • Depends on what you want! • Better recognition = more samples • Speaker-specific models: 1 speaker generates lots of examples • Good for this speaker, but horrible for everyone else • More general models: Area-specific • The more speakers the better, but limited in scope, for instance only technical language
What Goes into Recognition? • Speech recognition consists of 2 parts: • 1. Recognition of the phonemes • 2. Recognition of the words • The two parts are done using the following techniques: • Method 1: Recognition by template • Method 2: Using a combination of: • HMM (Hidden Markov Models) • Language Models
Recognition by Template Matching • How is it done? • Record templates from a user & store in a library • Record the sample when used and compare against the library examples • Select closest example • Uses: • Voice dialing system on a cell phone • Simple command and control • Speaker ID
Recognition by Template Matching • Matching is done in the frequency domain • Different utterances might still vary quite a bit • Solution: use shift-matching • For each square compute: • Dist(template[i], sample[j]) + smallest_of( • Dist(template[i-1], sample[j]), • Dist(template[i], sample[j-1]), • Dist(template[i-1], sample[j-1])) • Remember which choice you took and count path
Recognition by Template Matching • Issues • What happens with no matches? • Need to deal with none of the above case • What happens when there are a lot of templates? • Harder to choose • Costly • Choose templates that are very different
Recognition by Template Matching • Advantages • Works well for small number of templates (<20) • Language Independent • Speaker Specific • Easy to Train (end user controls it) • Disadvantages • Limited by number of templates • Speaker specific • Need actual training examples
Extention to Template Matching • Main problem: there are a lot of words! • What if we used one phoneme for template? • Would work better, in terms of generality but some issues still remain • A better model: HMMs for Acoustic Model and Language Models
Speech Recognition • Want to go from Acoustics to Text • Acoustic Modeling: • Recognize all forms of phonemes • Probability of phonemes given acoustics • Language Modeling • Expectation of what might be said • Probability of word strings • Need both to do recognition
Acoustic Models • Similar to templates for each phoneme • Each phoneme can be said very many ways • Can average over multiple examples • Different phonetic contexts • Ex. “sow” vs. “see” • Different people • Different acoustic environments • Different channels
HMMs • Markov Process: • Future can be predicted from the past • P(Xt+1 | Xt, Xt-1, … Xt-m) • Hidden Markov Models • State is unknown • Probability is given for each state • So: Given observation O and model M • Efficiently file P(O|M) • This is called decoding • Find the sum of all path probabilities • Each path probability is product of each transition in state sequence • Use dynamic programming to find the best path
HMM Recognition • Use one HMM for each phone type • Each observation • Probability distribution of possible phone types • Thus can find most probable sequence • Viterbi algorithm used to find the best path
Combining Language and Acoustic Models • Not all phones are equi-probable! • Find sequences that maximize: P(W | O) • Bayes Law: P(W | O) = P(W)P(O|W) / P(O) • HMMs give us P(O|W) • Language model: P(W)
Language Models • What are the most common words? • Different domains have different distributions • Computer Science Textbook • Kids Books • Context helps prediction
Language Models • Suppose you have the following data: • Source “Goodnight Moon” by Margaret Wise Brown In the great green room There was a telephone And a red balloon And a picture of – The cow jumping over the moon … Goodnight room Goodnight moon Goodnight cow jumping over the moon
Language Models • Let’s build a language model! • Can have uni-gram (1-word) and bi-gram (2-word) models • But first we have to preprocess the data!
Language Models • Data Preprocessing: • First remove all line breaks and punctuation • In the great green room There was a telephone And a red balloon And a picture of The cow jumping over the moon Goodnight room Goodnight moon Goodnight cow jumping over the moon • For the purposes of speech recognition we don’t care about capitalization, so get rid of that! • in the great green room there was a telephone and a red balloon and a picture of the cow jumping over the moon goodnight room goodnight moon goodnight cow jumping over the moon • Now we have our training data! • Note for text recognition things like sentences and punctuation matter, but we usually replace those with tags, ex <sentence>I have a cat</sentence>
Language Models • Now count up how many of each word we have (uni-gram) • Then compute probabilities of each word and voila!
Language Models • What are bigram models? And what are they good for? • More dependant on the content, so would avoid word combinations like • “telephone room” • “I green like” • Can also use grammars but the process of generating those is pretty complex
Language Models • How cam we improve? • Look at more than just 2 words (tri-grams, etc) • Replace words with types • “I am going to <City>” instead of “I am going to Paris”
Example • Microsoft’s Dictation tool
Text To Speech • Speech Synthesis • Text Analysis • Strings of characters to words • Linguistic Analysis • From words to pronunciations and prosidy • Waveform Synthesis • From pronunciations to waveforms
Text-To-Speech • What can pose difficulties? • Numbers • Abbreviations and letter sequences • Spelling errors • Punctuation • Text layout
Example! • AT&T’s speech synthesizer • http://www.research.att.com/~ttsweb/tts/demo.php#top • Windows TTS
Sources • Some of the slides were adapted from: www.speech.cs.cmu.edu/15-492 • Wikipedia.com • Amanda Stent’s Speech Processing slides