730 likes | 1.08k Views
Automatic Speech Recognition (ASR). HISTORY, ARCHITECTURE, COMMON APPLICATIONS AND THE MARKETPLACE. Omar Khalil Gómez – Università di Pisa. What is ASR?. Spoken language understanding is a difficult task I will become a pirate ” vs “I will become a pilot ”
E N D
AutomaticSpeechRecognition (ASR) HISTORY, ARCHITECTURE, COMMON APPLICATIONS AND THE MARKETPLACE Omar Khalil Gómez – Università di Pisa
Whatis ASR? • Spokenlanguageunderstandingis a difficulttask • I willbecome a pirate” vs “I willbecome a pilot” • ASR “addresses” thistaskcomputationally • Fromanacousticsignalto a string of words -> mapping • Automaticspeechunderstanding (ASU) isthegoal • Understandthesentenceratherthanknowjustthewords • Anotherrelatedfields • Speechsynthesis, text-to-speech
ASR then and… tomorrow¿? Origin future • Whyshould i need ASR? • Firstelectricimplements(1800) • Can weemulatethe human behaviour? • Strong-AI • Commercialapplications in telecomunication • Defensivepurposes
History of AutomaticSpeechRecognition Fromspeechproductiontotheacoustic-languagemodel
History of ASR: From Speech Production Models to Spectral Representations • First attempts to mimic a human’s speech communication • Interest was creating a speaking machine. • In 1773 Kratzensteinsucceeded in producing vowel sounds with tubes and pipes. • In 1791 Kempelen in Vienna constructed an “Acoustic-Mechanical Speech Machine”. • In the mid-1800's Charles Wheatstone built a version of von Kempelen's speaking machine. • In thefirsthalf of the 20th century, workers of Bell Laboratories found relationships between a given speech spectrum and its sound characteristics • Distribution of power of a speech sound across frequency • Isthemain concept tomodelthespeech. • In the 1930’s Homer Dudley (Bell Labs.) developed a speech synthesizer called the VODER based on that research. • Speech pioneers like Harvery Fletcher and Homer Dudley firmly established the importance of the signal spectrum for reliable identification of the phonetic nature of a speech sound.
History of ASR: EarlyAutomaticSpeechRecognizers • Early attempts to design systems for automatic speech recognition were mostly guided by the theory of acoustic-phonetics. • Analyze phonetic elements of speech: how are they acoustically realized? • Relationbetween place/manner of articulation and thedigitalizedspeech. • Firstadvances: • Goodresults in digitrecognition (1952) • Recognitiononcontinousspeechwithvowels and numbers (isolatedworddetection) (60’s) • First uses of statisticalsyntax at phonemelevel (60’s) • Butthesemodelsdidn’ttakeintoaccountthetemporal non-uniformity of speech events. • In the 70’s arrivedthedynamicprogramming (viterbi),
History of ASR: Technology Drivers since the 1970’s (I) • Tom Martin developedthefirst ASR system, used in fewapplications: • FedEx • DARPA • Harpy: recognize speech using a vocabulary of 1,011 words • Phone tempate matching • Thespeechrecognitionlanguageisrepresentedby a connectednetwork • Syntacticalproduction rules • Word boundary rules • Hearsay • Generatehypothesigiveninformationprovidedfromparallelsources. • HWIM • Phonological rules -> phonemerecognitionaccuracy
History of ASR: Technology Drivers since the 1970’s (II) • IBM’sTangora • Speaker-dependantsystemfor a voice-activatedtypewriter. • Structure of languagemodelrepresentedbystatistical and syntactical rules: n-gram. • Claude Shannon’s Word gamestronglyvalidatedthepower of the n-gram. • AT&T Bell Labs • Speaker-independantappplicationsforautomatedtelecommunicationservices • Niceworkwithacousticvariabilityoracousticmodel • Thisledtothecreation of speechclusteringalgorithmsforsoundreferencepatterns • Keywordspottingtotrainalso • Thesetwoapproacheshad a profoundinfluence in theevolution of human-speechcommunications • Thenthequickdevelopment of statisticalmethods in the 80’s caused a certaindegree of convergence in thesystemdesign
History of ASR: Technology Directions in the 1980’s and 1990’s • Speechrecognitionshift in methodology • Fromtemplate-basedapproach • Torigorousstatisticalmodelingframework(HMM) • Theapplication of the HMM becamethepreferredmethod in mid 80’s • Anothersystemslike ANN wereused • Notgoodbecause of temporal variation of speech • In the 90’s theproblemwastransformedintoanoptimizationproblem • Kernel-basedmethodssuchsupport vector machines. • Real applications emerged onthe 90’s • Individual researchprogramsallovertheworld • Open-source software, API’s • …
Architecture of an ASR system designingtheacoustic-languagemodel
Architecture of an ASR system: TheNoisyChannelmodel • Noisychannelmetaphore • Knowhowthe cannel distortsthesource • Then use thisknowledgeto compute themostlikelystringoverthelanguagewhichbestfitsthe input • Bestfitsthe input?? Metricforsimilarity • Overthewholelanguage?? Efficientsearch
Architecture of an ASR system • To pick thesentencethatbestmatchesthenoisy input • Bayesianinference and HMM • Eachstate oh the HMM is a typhe of phone • Theconectionsputconstraintsgiventhelexicon • Compute theprobabilities of transitions in time • Thesearch of thatsentencemust be efficient • Viterbidecodingalgorithmfor HMM
Architecture of an ASR system: Bayesianinference • Whatisthemostlikelysentenceout of allsentences in thelanguage L givensomeacoustic input O? • Acoustic input as a sequence of individual “symbols” or “observations” • Sentence as a string of words • Bayesianinferencetoaddressthisproblem: • Likelihood: computedbytheacousticmodel • Prior probability: computedbythelanguagemodel
Architecture of an ASR system: The HMM • Featureextraction • Acousticwaveformissampled in frames • Each time windowisrepresentedwith a vector of features • Gaussianmodelto compute • q: a state of the HMM • o: observationor vector of features • Thisisproducing a vector of probabilitiesforeachframe • Eachcomponentwill be theprobabilitythateach pone orsubphonecorrespondto thesefeatures. • The HMM Phoneticdictionaryorlexicon • N-gramrepresentation • Use Viterbialgorithm
Theacousticmodel Feautreextraction and likelihoodcalculation
Theacousticmodel • Likelihood AcousticModel (AM) • Extractfeatures of thesounds • Thesoundisprocessed and weget a nicerepresentation MFCS • Gaussian mixture modelto compute thelikelihood of therepresentationfor a pone (word) • Compute : a pone orsubphonecorrespondsto a state q in our HMM
Extractingfeatures • Transformthe input waveforminto a sequece of acousticfeaturevectors MFCC • Each vector representstheinformation in a small time window of thesignal. • Common in speechrecognition, melfrequencycepstralcoefficients • Basedonthe idea of cepstrum • Firststepisconverttheanalogrepresentationsinto a digital signal • Sampling: measuretheamplitude at a particular time (samplingrate) • Quantization: represent and store thesamples • We are thenreadytoextract MFCC features
Extractingfeatures: Pre-emphasis • Input: waveform • Output: thewaveformwiththehighfrequenciesboosted • Reason: highrequenciesoffer a lot of information
Extractingfeatures: Windowing • Input: waveformwithboostedhighfrequencies • Output: frammedwaveform • Reason: • Thewaveformchangesveryquicly • Properties are notconstantthrough time
Extractingfeatures: Discrete Fourier transform • Input: windowedsignal • Output: foreach of N discretefrequencybandswegetthesoundpresure • Reason: get new information • Amount of energyrelatedtothefrequency • Vowels
Extractingfeatures: Melfilterbank and log • Input: informationabouttheamount of energyfor a frequency • Output: log (wrappedfrequencies) • Wrappingwithmelscale • Log makesa nicechangetointerpret data • Reason: interestingfrequencybands are in aninterval
Extractingfeatures: InverseDiscrete Fourier Transform • Input: informationaboutamount of energy/frequency in theinterestingintervalsorspectrum • Output: cepstrumorthespectrum of the log of thespectrum (first 12 cepstralvalues) • Reason: more information • Usefullprocessinadvantages • Improvesphonerecognition • Vocal tractfilter, pitch consonants
Extractingfeatures: Deltas and energy • Input: cepstralform • Output: deltas foreach 12-value cepstral in a window and energy of thewindow • Reason: • Energyisusefultodetectstops and thensillabes and phones • Delta Velocity: representchangesbetweenwindows (energy) • Double delta Acceleration: changebetweenframes in thecorrespondingfeature of delta
Extractingfeatures: MFCC • 12 cepstralcoefficients • 12 delta cepstralcoefficients • 12 double delta cepstralcoefficients • 1 energycoefficient • 1 delta energycoefficient • 1 double delta energycoefficient • 39 MFCC features
Acousticlikelihoods: differentapproaches • Wehaveto compute thelikelihood of thesefeaturevectorsgivenanHMMstate • Given q and o, get p(o|q) • Forpart-of-speechtaggingeachobservationis a discrete symbol • Forspeechrecognitionwedealwithvectors discretize?? • Sameproblemwhendecoding and training • Weneedtogetthematrix B and thenchangethe training algorithm. • Differentapproaches • Vector quantization • GaussianPDF’s • ANN, SVM, Kernelmethods
Acousticlikelihoods: Vector quantization • Usefulpedagogicalstep • Notused in reality • Clusterize • Getprototypevectors • Compute distanceswith a metric • Euclidean • Mahanabolis • Train withanalgorithm • Knn • K-means • Getthemostprobably symbol givenanobservation b(i)
Acousticlikelihoods: GaussianPDFs • Speechissimply non categorical, symbolicprocess. • Wemust compute theobservationprobabilitiesdirectlyonthefeaturevectors • Probabilitydensitiyfunctionoverspace • Unvariategaussians • Simplest use of gaussianprobabilityestimator • Probability: areaunderthe curve = 1 • Onegaussiantellsushow probable thevalue of a featureisto be generatedbyan HMM state
Acousticlikelihoods: GaussianPDFs • Multivariategaussians • Single cepstralfeatureto 39-dimension vector new dimension • Use a gaussianforeachfeature supposingthedistribution • Gaussian mixture models • A particular cepstralvaluemighthave non-normal distribution • Weighted mixture of multivariategaussians • TrainedwiththeBaum-Welchalgorithm
Acousticlikelihoods: Probabilities and distancefunctions • Log Probabilityismucheasytoworkwiththanprobability • Multipliyingmanyprobabilityresults in smallnumbers underflow • The log of a numberismucheasywaytowork VS • Computationalspeedbeacusewe are adding
Thelanguagemodel N-gram and lexicon
Thelanguagemodel • Prior TheLanguageModel (LM) • Howlikely a string of wordsisto be a real englishsentence • N-gramapproach • We can seethe HMM as a net giventhelexicon… • List of words, pronunciationdictionaries • Accoringtobasicphones • Phoneticresources in the web forseverallanguages • Usefull in otherfields • More thanpronunctiation • Stress level • Morphologicalinformation • Part of speechinformation • … • …and the N-gram
Thelanguagemodel: HMM and Lexicon • Sequences of HMM statesconcatenated • Left-to-right HMM • Simple ASR tasks can use a directrepresentation • For LVCSR weneed more granularitybecause of thechanges in theframes • Phone can reachthesecond samplingrate 10ms 100 framesfor a pone (different)
Thelangaugemodel: The N-gram • Assignprobabilityto a sentence • 3-grams or4-grams • Dependingontheapplication • Dependingonthevocabularysize • Workingwithtextwewanttoknowtheprobability of a Word givensomehistory • Workingwithspeechwewanttoknowtheprobability of a phonegivensomehistory • Chain-rule probability • Thelenght of thehitoryis N • Featuresalsovalidforspeechrecognition
Decodingand searching Puttingalltogether
Decoding a sentence: joinprobabilites • Wehaveto combine alltheprobabilityestimatorstosolvetheproblem of decoding • Produce themost probable string of words • Modifications are needed in ourbayesianinference • Incorrect Independence assumptions • We are underestimatingtheprobability of eachsubphone • Reweigththeprobabilitiesbyaddinglanguagemodelscaling factor • Reweightrequiresone more change: • P(W) has a side-effect as a penaltyforinsertingwords • A sentencewith N wordswillhavelanguagemodelprobability of • Bag of words, themordwords in thesentence, the more times this penalti istaken and theless probable thesentencewill be
Decoding a sentence: Againthe HMM • Difficulttasks • HMM • Set of states • Matrix A of probabilities • A set of observationlikelihoods • Suppose A and B are trained • Viterbialgorithmtosearchefficiently
Searching: TheViterbialgorithm • Possiblecombinationsforthe Word “five” • Viterbitrellis • Representstheprobabilitythatthe HMM is in state j afterseeingthefirst t observations and passingthroughthemostlikelystatesequence • One per state
Training Embedded and viterbi training
Training: Embedded Training • Howan HMM-basedspeechrecognitionistrained? • Simplest Hand labeledisolated Word • Train A and B separately • Phonehandsegmented • Justtrainbycounting in the training set • Tooexpensive and slow • Goodway traineach pone HMM ebedded in anentiresentence • Anyways, handphonesegmentation do playsome role • Transcription and wavefiletotrain • Baum-Welchalgorithm
Evaluation Word error rate and mcnemar test
Evaluation: Error Rate • Standarmetric error rate • Differencebetweenpredictedstring and expected Minimumeditdistancefor WER • Setence error rate • Minimumeditdistanceby • Free script availablefromtheNationalInstitute of Standars and Technologies • Confusion matrices • Usefullfor test if a change in a systemismade
Evaluation: McNemar Test • MAPSSWE orMcNemar Test • Looks at thedifferencesbetweenthenumber of worderrors of thetwosystems • Averageacross a number of segments
Applications of ASR Principlecommercializedapplications