Automatic Speech Recognition (ASR)

AutomaticSpeechRecognition (ASR) HISTORY, ARCHITECTURE, COMMON APPLICATIONS AND THE MARKETPLACE Omar Khalil Gómez – Università di Pisa

Whatis ASR? • Spokenlanguageunderstandingis a difficulttask • I willbecome a pirate” vs “I willbecome a pilot” • ASR “addresses” thistaskcomputationally • Fromanacousticsignalto a string of words -> mapping • Automaticspeechunderstanding (ASU) isthegoal • Understandthesentenceratherthanknowjustthewords • Anotherrelatedfields • Speechsynthesis, text-to-speech

ASR then and… tomorrow¿? Origin future • Whyshould i need ASR? • Firstelectricimplements(1800) • Can weemulatethe human behaviour? • Strong-AI • Commercialapplications in telecomunication • Defensivepurposes

History of AutomaticSpeechRecognition Fromspeechproductiontotheacoustic-languagemodel

History of ASR: From Speech Production Models to Spectral Representations • First attempts to mimic a human’s speech communication • Interest was creating a speaking machine. • In 1773 Kratzensteinsucceeded in producing vowel sounds with tubes and pipes. • In 1791 Kempelen in Vienna constructed an “Acoustic-Mechanical Speech Machine”. • In the mid-1800's Charles Wheatstone built a version of von Kempelen's speaking machine. • In thefirsthalf of the 20th century, workers of Bell Laboratories found relationships between a given speech spectrum and its sound characteristics • Distribution of power of a speech sound across frequency • Isthemain concept tomodelthespeech. • In the 1930’s Homer Dudley (Bell Labs.) developed a speech synthesizer called the VODER based on that research. • Speech pioneers like Harvery Fletcher and Homer Dudley firmly established the importance of the signal spectrum for reliable identification of the phonetic nature of a speech sound.

History of ASR: EarlyAutomaticSpeechRecognizers • Early attempts to design systems for automatic speech recognition were mostly guided by the theory of acoustic-phonetics. • Analyze phonetic elements of speech: how are they acoustically realized? • Relationbetween place/manner of articulation and thedigitalizedspeech. • Firstadvances: • Goodresults in digitrecognition (1952) • Recognitiononcontinousspeechwithvowels and numbers (isolatedworddetection) (60’s) • First uses of statisticalsyntax at phonemelevel (60’s) • Butthesemodelsdidn’ttakeintoaccountthetemporal non-uniformity of speech events. • In the 70’s arrivedthedynamicprogramming (viterbi),

History of ASR: Technology Drivers since the 1970’s (I) • Tom Martin developedthefirst ASR system, used in fewapplications: • FedEx • DARPA • Harpy: recognize speech using a vocabulary of 1,011 words • Phone tempate matching • Thespeechrecognitionlanguageisrepresentedby a connectednetwork • Syntacticalproduction rules • Word boundary rules • Hearsay • Generatehypothesigiveninformationprovidedfromparallelsources. • HWIM • Phonological rules -> phonemerecognitionaccuracy

History of ASR: Technology Drivers since the 1970’s (II) • IBM’sTangora • Speaker-dependantsystemfor a voice-activatedtypewriter. • Structure of languagemodelrepresentedbystatistical and syntactical rules: n-gram. • Claude Shannon’s Word gamestronglyvalidatedthepower of the n-gram. • AT&T Bell Labs • Speaker-independantappplicationsforautomatedtelecommunicationservices • Niceworkwithacousticvariabilityoracousticmodel • Thisledtothecreation of speechclusteringalgorithmsforsoundreferencepatterns • Keywordspottingtotrainalso • Thesetwoapproacheshad a profoundinfluence in theevolution of human-speechcommunications • Thenthequickdevelopment of statisticalmethods in the 80’s caused a certaindegree of convergence in thesystemdesign

History of ASR: Technology Directions in the 1980’s and 1990’s • Speechrecognitionshift in methodology • Fromtemplate-basedapproach • Torigorousstatisticalmodelingframework(HMM) • Theapplication of the HMM becamethepreferredmethod in mid 80’s • Anothersystemslike ANN wereused • Notgoodbecause of temporal variation of speech • In the 90’s theproblemwastransformedintoanoptimizationproblem • Kernel-basedmethodssuchsupport vector machines. • Real applications emerged onthe 90’s • Individual researchprogramsallovertheworld • Open-source software, API’s • …

History of ASR: Overview

The variable dimensions of ASR

Large-vocabularycontiuousspeechrecognition (LVCSR)

Architecture of an ASR system designingtheacoustic-languagemodel

Architecture of an ASR system: TheNoisyChannelmodel • Noisychannelmetaphore • Knowhowthe cannel distortsthesource • Then use thisknowledgeto compute themostlikelystringoverthelanguagewhichbestfitsthe input • Bestfitsthe input??  Metricforsimilarity • Overthewholelanguage??  Efficientsearch

Architecture of an ASR system • To pick thesentencethatbestmatchesthenoisy input • Bayesianinference and HMM • Eachstate oh the HMM is a typhe of phone • Theconectionsputconstraintsgiventhelexicon • Compute theprobabilities of transitions in time • Thesearch of thatsentencemust be efficient • Viterbidecodingalgorithmfor HMM

Architecture of an ASR system: Bayesianinference • Whatisthemostlikelysentenceout of allsentences in thelanguage L givensomeacoustic input O? • Acoustic input as a sequence of individual “symbols” or “observations” • Sentence as a string of words • Bayesianinferencetoaddressthisproblem: • Likelihood: computedbytheacousticmodel • Prior probability: computedbythelanguagemodel

Architecture of an ASR system

Architecture of an ASR system: The HMM • Featureextraction • Acousticwaveformissampled in frames • Each time windowisrepresentedwith a vector of features • Gaussianmodelto compute • q: a state of the HMM • o: observationor vector of features • Thisisproducing a vector of probabilitiesforeachframe • Eachcomponentwill be theprobabilitythateach pone orsubphonecorrespondto thesefeatures. • The HMM  Phoneticdictionaryorlexicon • N-gramrepresentation • Use Viterbialgorithm

Theacousticmodel Feautreextraction and likelihoodcalculation

Theacousticmodel • Likelihood AcousticModel (AM) • Extractfeatures of thesounds • Thesoundisprocessed and weget a nicerepresentation  MFCS • Gaussian mixture modelto compute thelikelihood of therepresentationfor a pone (word) • Compute : a pone orsubphonecorrespondsto a state q in our HMM

Extractingfeatures • Transformthe input waveforminto a sequece of acousticfeaturevectors MFCC • Each vector representstheinformation in a small time window of thesignal. • Common in speechrecognition, melfrequencycepstralcoefficients • Basedonthe idea of cepstrum • Firststepisconverttheanalogrepresentationsinto a digital signal • Sampling: measuretheamplitude at a particular time (samplingrate) • Quantization: represent and store thesamples • We are thenreadytoextract MFCC features

Extractingfeatures: Pre-emphasis • Input: waveform • Output: thewaveformwiththehighfrequenciesboosted • Reason: highrequenciesoffer a lot of information

Extractingfeatures: Windowing • Input: waveformwithboostedhighfrequencies • Output: frammedwaveform • Reason: • Thewaveformchangesveryquicly • Properties are notconstantthrough time

Extractingfeatures: Discrete Fourier transform • Input: windowedsignal • Output: foreach of N discretefrequencybandswegetthesoundpresure • Reason: get new information • Amount of energyrelatedtothefrequency • Vowels

Extractingfeatures: Melfilterbank and log • Input: informationabouttheamount of energyfor a frequency • Output: log (wrappedfrequencies) • Wrappingwithmelscale • Log makesa nicechangetointerpret data • Reason: interestingfrequencybands are in aninterval

Extractingfeatures: InverseDiscrete Fourier Transform • Input: informationaboutamount of energy/frequency in theinterestingintervalsorspectrum • Output: cepstrumorthespectrum of the log of thespectrum (first 12 cepstralvalues) • Reason: more information • Usefullprocessinadvantages • Improvesphonerecognition • Vocal tractfilter, pitch  consonants

Extractingfeatures: Deltas and energy • Input: cepstralform • Output: deltas foreach 12-value cepstral in a window and energy of thewindow • Reason: • Energyisusefultodetectstops and thensillabes and phones • Delta  Velocity: representchangesbetweenwindows (energy) • Double delta  Acceleration: changebetweenframes in thecorrespondingfeature of delta

Extractingfeatures: MFCC • 12 cepstralcoefficients • 12 delta cepstralcoefficients • 12 double delta cepstralcoefficients • 1 energycoefficient • 1 delta energycoefficient • 1 double delta energycoefficient • 39 MFCC features

Acousticlikelihoods: differentapproaches • Wehaveto compute thelikelihood of thesefeaturevectorsgivenanHMMstate • Given q and o, get p(o|q)  • Forpart-of-speechtaggingeachobservationis a discrete symbol • Forspeechrecognitionwedealwithvectors  discretize?? • Sameproblemwhendecoding and training • Weneedtogetthematrix B and thenchangethe training algorithm. • Differentapproaches • Vector quantization • GaussianPDF’s • ANN, SVM, Kernelmethods

Acousticlikelihoods: Vector quantization • Usefulpedagogicalstep • Notused in reality • Clusterize • Getprototypevectors • Compute distanceswith a metric • Euclidean • Mahanabolis • Train withanalgorithm • Knn • K-means • Getthemostprobably symbol givenanobservation b(i)

Acousticlikelihoods: GaussianPDFs • Speechissimply non categorical, symbolicprocess. • Wemust compute theobservationprobabilitiesdirectlyonthefeaturevectors • Probabilitydensitiyfunctionoverspace • Unvariategaussians • Simplest use of gaussianprobabilityestimator • Probability: areaunderthe curve = 1 • Onegaussiantellsushow probable thevalue of a featureisto be generatedbyan HMM state

Acousticlikelihoods: GaussianPDFs • Multivariategaussians • Single cepstralfeatureto 39-dimension vector  new dimension • Use a gaussianforeachfeature supposingthedistribution • Gaussian mixture models • A particular cepstralvaluemighthave non-normal distribution • Weighted mixture of multivariategaussians • TrainedwiththeBaum-Welchalgorithm

Acousticlikelihoods: Probabilities and distancefunctions • Log Probabilityismucheasytoworkwiththanprobability • Multipliyingmanyprobabilityresults in smallnumbers underflow • The log of a numberismucheasywaytowork VS • Computationalspeedbeacusewe are adding

Thelanguagemodel N-gram and lexicon

Thelanguagemodel • Prior  TheLanguageModel (LM) • Howlikely a string of wordsisto be a real englishsentence • N-gramapproach • We can seethe HMM as a net giventhelexicon… • List of words, pronunciationdictionaries • Accoringtobasicphones • Phoneticresources in the web forseverallanguages • Usefull in otherfields • More thanpronunctiation • Stress level • Morphologicalinformation • Part of speechinformation • … • …and the N-gram

Thelanguagemodel: Pronunciationlexicon

Thelanguagemodel: HMM and Lexicon • Sequences of HMM statesconcatenated • Left-to-right HMM • Simple ASR tasks can use a directrepresentation • For LVCSR weneed more granularitybecause of thechanges in theframes • Phone can reachthesecond samplingrate 10ms  100 framesfor a pone (different)

Thelangaugemodel: The N-gram • Assignprobabilityto a sentence • 3-grams or4-grams • Dependingontheapplication • Dependingonthevocabularysize • Workingwithtextwewanttoknowtheprobability of a Word givensomehistory • Workingwithspeechwewanttoknowtheprobability of a phonegivensomehistory • Chain-rule probability • Thelenght of thehitoryis N • Featuresalsovalidforspeechrecognition

Decodingand searching Puttingalltogether

Decoding a sentence: joinprobabilites • Wehaveto combine alltheprobabilityestimatorstosolvetheproblem of decoding • Produce themost probable string of words • Modifications are needed in ourbayesianinference • Incorrect Independence assumptions • We are underestimatingtheprobability of eachsubphone • Reweigththeprobabilitiesbyaddinglanguagemodelscaling factor • Reweightrequiresone more change: • P(W) has a side-effect as a penaltyforinsertingwords • A sentencewith N wordswillhavelanguagemodelprobability of • Bag of words, themordwords in thesentence, the more times this penalti istaken and theless probable thesentencewill be

Decoding a sentence: Againthe HMM • Difficulttasks • HMM • Set of states • Matrix A of probabilities • A set of observationlikelihoods • Suppose A and B are trained • Viterbialgorithmtosearchefficiently

Searching: TheViterbialgorithm • Possiblecombinationsforthe Word “five” • Viterbitrellis • Representstheprobabilitythatthe HMM is in state j afterseeingthefirst t observations and passingthroughthemostlikelystatesequence • One per state

Searching: TheViterbialgorithm

Training Embedded and viterbi training

Training: Embedded Training • Howan HMM-basedspeechrecognitionistrained? • Simplest Hand labeledisolated Word • Train A and B separately • Phonehandsegmented • Justtrainbycounting in the training set • Tooexpensive and slow • Goodway  traineach pone HMM ebedded in anentiresentence • Anyways, handphonesegmentation do playsome role • Transcription and wavefiletotrain • Baum-Welchalgorithm

Evaluation Word error rate and mcnemar test

Evaluation: Error Rate • Standarmetric error rate • Differencebetweenpredictedstring and expected  Minimumeditdistancefor WER • Setence error rate • Minimumeditdistanceby • Free script availablefromtheNationalInstitute of Standars and Technologies • Confusion matrices • Usefullfor test if a change in a systemismade

Evaluation: McNemar Test • MAPSSWE orMcNemar Test • Looks at thedifferencesbetweenthenumber of worderrors of thetwosystems • Averageacross a number of segments

Applications of ASR Principlecommercializedapplications

Automatic Speech Recognition (ASR)