270 likes | 475 Views
LING 439/539: Statistical Methods in Speech and Language Processing. Ying Lin Department of Linguistics University of Arizona. Welcome!. Get the syllabus Fill out and return the information sheet Email: yinglin@email.arizona.edu Office: Douglass 224
E N D
LING 439/539: Statistical Methods in Speech and Language Processing Ying Lin Department of Linguistics University of Arizona
Welcome! • Get the syllabus • Fill out and return the information sheet • Email: yinglin@email.arizona.edu • Office: Douglass 224 • OH: MW 2:00 --3:00 by appoint (also teaching another undergrad class) • Course webpage: see syllabus • Listserv coming soon.
438/538 and 439/539 • LING 438/538 (Computational Linguistics): • Symbolic representations (mostly syntax), e.g. FSA, CFG. • Focus on logic • Simple probabilistic models, e.g. N-grams.
438/538 and 439/539 • This class complements 438/538: • Numerical representations (speech signals): need digital signal processing • Focus on statistics/learning • More sophisticated probabilistic models, e.g. HMM, PCFG
Main reference texts (!) • Huang, Acero and Hon (2001). Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice-Hall. • Manning and Schutze (1999). Foundations of Statistical Natural Language Processing. MIT Press. • Rabiner and Juang (1993). Fundamental of Speech Recognition. Prentice-Hall. • Duda, Hart and Stork (2001). Pattern Classification (2nd ed). JohnWiley & Sons. • Rabiner and Schafer (1978). Digital Processing of Speech Signals. Prentice-Hall. • Hastie, Tibshirani and Friedman (2001). The Elements of Statistical Learning. Springer.
Guideline for course reading • There is no single book that covers all of our materials. • Most books are written either for EE or CS audience only. • A few chapters are selected from each book (see the reading list). Lecture notes will summarize the reading. • Expect a rough ride for the first time -- feedback is greatly appreciated!
Three skills for this class • 1. Linguistics: understanding source of particular patterns. • 2. Math/Statistics: underlying principles of the model. • 3. Programming: implementation • This class emphasizes 2, reason: • Models are based on simple structures • Programming skills require much practice
What is “statistical approach”? • Narrow: uses statistical principle, I.e. based on the probability calculus or other theories of inductive inference • Compared to logic: dedutive inference • Broad: any work that uses a quantative measure of success • Relevant to both language engineering and linguistic science
What is “statistical approach”? • Narrow: uses statistical principle, I.e. based on the probability calculus or other theories of inductive inference • Compared to logic: dedutive inference • Broad: any work that uses a quantative measure of success • Relevant to both anguage engineering and linguistic science Thiscourse
Language engineering: speech recognition • Tasks: increasing level of difficulty WordError Rate
A brief history of speech recognition • 1950’s: U.S. government started funding research on automatic recognition of speech • 1960-70’s: Isolated words, digit strings • Debate: rules v.s. statistics • Dynamic time warping • 1980-now: continuous speech, speech understanding, spoken dialog • Hidden Markov model dominates
Why the rules didn’t work? • Completely bottom-up approach: • Rules are hand-coded by experts • Problem: variability in speech • Sophisticated, symbolic rules are not flexible enough to handle continuous speech Phonetic rules Phonological rules “How are you?” h A U A j o U
The rise of statistical methods in speech • Initial solution: hire many linguists to continually improve the rule system • This turns out to be costly and slow, failing the high expectation • Advantage of statistical models: • Allows training on different data: flexible, scalable • Computing power much cheaper than expert • Drives the move to less and less constrained tasks • Bitterness: “every time I fire a linguist, the word error rate goes up” -- F. Jelinek (IBM)
The rise of statistics in NLP • Very similar scenarios also happened in NLP: • E.g. tagging, parsing, machine translation • “Old” NLP: deductive systems, hand-coded • “New” NLP: broad-coverage, corpus-based, emphasize training, evaluation • Speech is now merging with NLP • Many tools originated in speech, then got copied to NLP • New task keep emerging: web as an (unstructured) data source
Basic architecture of today’s ASR system Language model Acoustic modeling p(M1),p(M2) X Audio speech Feature extraction Likelihood p(X|M1), p(X|M2) Scoring rank Model parameters trained offline: M1 = “I recognize speech” M2 = “I wreck a nice beach” … ANSWER
Component 1: signal processing / feature extraction • First 1/3 of the course (also useful for understanding synthesis):
Component 2: Acoustic models • Mixture of Gaussians: p(ot | qi) = • Dimension reduction: principle component analysis, linear discriminant analysis, parameter tying
start j ou end a Component 3:Pronunciation modeling • Model for differnent pronunciations of “you” in continuous speech • Other types of units: triphones, syllables Each unit is an HMM
Component 4: Language model • Provide the probability of word sequence models p(M) to combine with the acoustic model p(X|M) • Common: N-gram with smoothing, backoff, very hard and specialized business • Just starting to integrate parsing • Fundamental equation:M* = argmaxM p(M|X) = argmaxM p(X|M)p(M)Viterbi, beam, A*, N-best search
ASR: example of a generative model • Component 2+3+4 provide an instance of generative models • Language M generates word sequences • Word sequence generates pronunciation • Pronunciation generates acoustic features • Unsupervised learning/training • Maximum likelihood estimation • Expectation-Maximization algorithm (different incarnations) • Main focus of this class
Other models to look at: • Descriptive/maximum entropy models • Started in vision, then copied to speech, then NLP • Discriminative models: directly using data to construct classifiers, with weak assumptions about prob distribution • Supervised learning, focus on the perspective of classification Input string Feature vector Output labels count classifier “Machine learning approach to NLP”
Problem solved? • No, improvements are mostly due to larger training set and speed up Driven byMoore’s law?
Challenges • Environment distortion (microphone, noise, cocktail party) breaks feature extraction • Acoustic condition mismatch • Between + within speaker variability breaks the pronunciation modeling and acoustic modeling • Conversational speech breaks the language model • Understanding these problems is crucial for improving the performance of ASR
Dreaming • “2001: A Space Odyssey” (1968) Dave: “Open the pod bay doors, HAL” HAL9000: “I’m sorry Dave. I’m afraid I can’t do that.”
The reality,before the problem is solved • Speech is used as a user interface only when people can’t use hand • Driving a car (use speech to drive?) • Device too small (cellphone) • Customer service (who will tolerate touch tone?) • Dictation (how many people actually use it?)
For next time: • We will start with signal processing • Uses engineering math, including power series (including convergence), trigonometric functions, integration and representation of complex numbers. • If you forgot or do not know these materials, please look for references and study it before class.