1 / 68

A brief overview of Speech Recognition and Spoken Language Processing

A brief overview of Speech Recognition and Spoken Language Processing. Advanced NLP Guest Lecture August 31 Andrew Rosenberg. Speech and NLP. Communication in Natural Language Text: Carefully prepared Grammatical Machine readable Typos Sometimes OCR or handwriting issues.

jud
Download Presentation

A brief overview of Speech Recognition and Spoken Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg

  2. Speech and NLP • Communication in Natural Language • Text: • Carefully prepared • Grammatical • Machine readable • Typos • Sometimes OCR or handwriting issues

  3. Speech and NLP • Communication in Natural Language • Speech: • Spontaneous • Less Grammatical • Machine readable • with > 10% error using on speech recognition.

  4. NLP Tasks • Parsing • Name Tagging • Sentiment Analysis • Entity Coreference • Relation Extraction • Machine Translation

  5. Speech Tasks • Parsing • Speech isn’t always grammatical • Name Tagging • If a name isn’t “in vocabulary” what do you do? • Sentiment Analysis • How the words are spokenhelps. • Entity Coreference • Relation Extraction • Machine Translation • how can these handle misrecognition errors?

  6. Speech Tasks • Speech Synthesis • Text Normalization • Dialog Management • Topic Segmentation • Language Identification • Speaker Identification and Verification • Authorship and security

  7. The traditional view Text Documents Training Text Processing System Named Entity Recognizer Text Documents Application

  8. The simplest approach Text Documents Training Text Processing System Named Entity Recognizer Transcribed Documents Application

  9. Speech is errorful text TranscribedDocuments Training Text Processing System Named Entity Recognizer Transcribed Documents Application

  10. Speech signal can be used TranscribedDocuments Training Text Processing System Named Entity Recognizer Transcribed Documents Application

  11. Hybrid speech signal and text Training TranscribedDocuments Text Documents Text Processing System Named Entity Recognizer Transcribed Documents Application

  12. Speech Recognition • Standard HMM speech recognition. • Front End • Acoustic Model • Pronunciation Model • Language Model • Decoding

  13. Speech Recognition Front End Acoustic Feature Vector Acoustic Model Phone Likelihoods Pronunciation Model Word Likelihoods Language Model Word Sequence

  14. Front End Convert sounds into a sequence of observation vectors Speech Recognition Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label

  15. Front End • How do we convert a wave form into a useful representation? • We are looking for a vector of numbers which describe the acoustic content • Assuming 22kHz 16bit sound. Modeling this directly is not feasible.

  16. Discrete Cosine Transform • Every wave can be decomposed into component sine or cosine waves. • Fast FourierTransform is used to do this efficiently

  17. Overlapping frames • Spectrograms allow for visual inspection of spectral information. • We are looking for a compact, numerical representation 10ms 10ms 10ms 10ms 10ms

  18. Single Frame of FFT Australian male /i:/ from “heed” FFT analysis window 12.8ms http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html

  19. Example Spectrogram

  20. “Standard” Representation • Mel Frequency Cepstral Coefficients • MFCC FFT Pre-Emphasis window Mel-Filter Bank energy log 12 MFCC 12 ∆ MFCC 12∆∆ MFCC 1 energy 1 ∆ energy 1 ∆∆ energy FFT-1 Deltas 12 MFCC

  21. Front End Convert sounds into a sequence of observation vectors Speech Recognition Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label

  22. Language Model • What is the probability of a sequence of words? • Assume you have a vocabulary of V words. • How many possible sequences of N words are there?

  23. N-gram Language Modeling • Simplify the calculation. • Big simplifying assumption: Each word is only dependent on the previous N-1 words.

  24. N-gram Language Modeling • Same question. Assume a V word vocabulary, and an N word sequence. How many “counts” are necessary?

  25. General Language Modeling • Any probability calculation can be used here. • Class based language models. • e.g. Recurrent neural networks

  26. Front End Convert sounds into a sequence of observation vectors Speech Recognition Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label

  27. Pronunciation Modeling • Identify the likelihood of a phone sequence given a word sequence. • There are manysimplifying assumptions in pronunciation modeling. • The pronunciation of each word is independent of the previous and following.

  28. Dictionary as Pronunciation Model • Assume each word has a single pronunciation

  29. Weighted Dictionary as Pronunciation Model • Allow multiple pronunciations and weight each by their likelihood

  30. Grapheme to Phoneme conversion • What about words that you have never seen before? • What if you don’t think you’ve seen every possible pronunciation? • How do you pronounce: “McKayla”? or “Zoomba”? • Try to learn the phonetics of the language.

  31. Letter to Sound Rules • Manually written rules that are able to convert one or more letters to one or more sounds. • T -> /t/ • H -> /h/ • TH -> /dh/ • E -> /e/ • These rules can get complicated based on the surrounding context. • K is silent when word initial and followed by N.

  32. Automatic learning of Letter to Sound rules • First: Generate an alignment of letters and sounds

  33. Automatic learning of Letter to Sound rules • Second: Try to learn the mapping automatically. • Generate “Features” from the letter sequence • Use these feature to predict sounds • Almost any machine learning technique can be used. • We’ll use decision trees as an example.

  34. Decision Trees example • Context: L1, L2, p, R1, R2 R1 = “h” Yes No P peanut P pay P apple ø apple ø psycho ø pterodactyl ø pneumonia P loophole F physics F telephone F graph F photo Yes No L1 = “o” P loophole F physics F telephone F graph F photo Yes No R1 = consonant P apple ø psycho ø pterodactyl øpneumonia P peanut P pay

  35. Decision Trees example • Context: L1, L2, p, R1, R2 try “PARIS” R1 = “h” Yes No P peanut P pay P apple ø apple ø psycho ø pterodactyl ø pneumonia P loophole F physics F telephone F graph F photo Yes No L1 = “o” P loophole F physics F telephone F graph F photo Yes No R1 = consonant P apple ø psycho ø pterodactyl øpneumonia P peanut P pay

  36. Decision Trees example • Context: L1, L2, p, R1, R2 Now try “GOPHER” R1 = “h” Yes No P peanut P pay P apple ø apple ø psycho ø pterodactyl ø pneumonia P loophole F physics F telephone F graph F photo Yes No L1 = “o” P loophole F physics F telephone F graph F photo Yes No R1 = consonant P apple ø psycho ø pterodactyl øpneumonia P peanut P pay

  37. Front End Convert sounds into a sequence of observation vectors Speech Recognition Language Model Calculate the probability ofa sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label

  38. Acoustic Modeling • Hidden markov model. • Used to model the relationship between two sequences.

  39. Hidden Markov model • In a Hidden Markov Model the state sequence is unobserved. • Only an observation sequence is available q1 q2 q3 x1 x2 x3

  40. Hidden Markov model • Observations are MFCC vectors • States are phone labels • Each state (phone) has an associated GMM modeling the MFCC likelihood q1 q2 q3 x1 x2 x3

  41. Training acoustic models • TIMIT • close, manual phonetic transcription • 2342 sentences • Extract MFCC vectors from each frame within each phone • For each phone, train a GMM using Expectation Maximization. • These GMM is the Acoustic Model. • Common to use 8, or 16 Gaussian Mixture Components.

  42. Gaussian Mixture Model

  43. HMM Topology for Training • Rather than having one GMM per phone, it is common for acoustic models to represent each phone as 3 triphones S3 S2 S4 /r/ S5 S1

  44. Speech in Natural Language Processing ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY

  45. Speech in Natural Language Processing Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily.

  46. Spoken Language Processing NLP system IR IE QA Summarization Topic Modeling Speech Recognition

  47. Spoken Language Processing NLP system IR IE QA Summarization Topic Modeling ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY

  48. Dealing with Speech Errors Robust NLP system IR IE QA Summarization Topic Modeling ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY

  49. Automatic Speech Recognition Assumption ASR produces a “transcript” of Speech. ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY

  50. Automatic Speech Recognition Assumption ASR produces a “transcript” of Speech. Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily. “Rich Transcription”

More Related