220 likes | 560 Views
The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir. Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science. Theoretical Background – Unit Selection.
E N D
The Acoustic/Lexical model:Exploring the phonetic units; Triphones/Senones in action.Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science
Theoretical Background – Unit Selection When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. • Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary continuous SR: • Each word is treated individually – no data sharing, which implies large amount of training data and storage. • The recognition vocabulary may consist of words which have never been given in the training data. • Expensive to model interword coarticulation effects.
Theoretical Background - Phonemes The alternative unit is a Phoneme. Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones.
Theoretical Background - Triphones The Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes. Triphones are an example of allophones. This model captures the most important coarticulatory effects, a fact which makes him a very powerful model. The cost – as context-dependent models generally increase the number of parameters, the trainability becomes much harder. Notice that in English there are more than 100,000 triphones ! Nevertheless, so far we have assumed that every triphone context is different. We are motivated to finds instances of similar contexts and merge them.
Theoretical Background - Senones Recall that each allophone model is an HMM, made of states, transitions and probability distributions; the bottom line is that some distributions can be tied. The basic idea is clustering, but rather than clustering the HMM models themselves – we shall cluster only the the HMM states. Each cluster will represent a set of similar Markov states, and is called a Senone. The senones provide not only an improved recognition accuracy, but also a pronunciation-optimization capability.
Theoretical Background – Senonic Trees Reminder: a decision tree is a binary tree which classifies target objects by asking Yes/No questions in a hierarchical manner. The senonic decision tree classifies Markov states of triphones, represented in the training data, by asking linguistic questions. => The leaves of the senonic trees are the possible senones.
12 elements 12 elements 39 elements Cepstrum Time-der Cepstrum Gaussian Mixtures 12 elements Current frame Time-2-der Cepstrum Mean, Variance, Determinant 3 elements Feature vectors and their analysis are inputs into Gaussian Mixtures Fitting Process. Power Fetch phonetic data (Senones !) from these Gaussian Mixtures – using the well-trained machine. Sphinx III, A Short Review – Front End Feature Extraction 7 frame speech window Senones Data (Scoring Table)
W AH N ONE T UW TWO TH R IY THREE 5-state HMM Sphinx III – the implementation Handling a single word; evaluating each HMM according to the input, using the Viterbi Search. Every senone gets a HMM:
The Viterbi Search - basics • Instantaneous score: how well a given HMM state matches the feature vector. • Path: A sequence of HMM states traversed during a given segment of feature vectors. • Path-score: Product of instantaneous scores and state transition probabilities corresponding to a given path. • The Viterbi search: An efficient lattice structure and algorithm for computing the best path score for a given segment of feature vectors.
The Viterbi Search - demo Initial state initialized with path-score = 1.0 time
State with best path-score State with path-score < best State without a valid path-score P (t) = max [P (t-1) a b (t)] j i ij j i The Viterbi Search (demo-contd.) State transition probability, i to j Score for state j, given the input at time t Total path-score ending up at state j at time t time
time The Viterbi Search (demo-contd.)
W AH N ONE T UW TWO TH R IY THREE Continuous Speech Recognition Add transitions from word ends to beginnings, and run the Viterbi Search.
Separate N HMM instances for each possible right context W AH N ONE Context- dependent AH HMM Inherited left context propagated along with path-scores, and dynamically modifies the state model. Cross-Word Triphone Modeling Sphinx III uses “triphone” or “phoneme-in-context” HMMs; Remember to inject left-context into entry state.
R TD start IX NG starting DX S T AA IX DD started PD startup R T AX PD start-up Sphinx-III - Lexical Tree Structure Nodes shared if triphone Senone-Sequence-ID (SSID) identical: START S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD
R TD left-contexts start IX NG starting DX S T AA IX DD started PD startup S-models for different left contexts R T AX PD start-up to rest of lextree Cross-Word Triphones (left context) Root nodes replicated for left context. Nodes are shared if SSIDs are identical.
Triphones for all right contexts Composite SSID model Leaf node composite states; average of component states Picking states HMM states for triphones Cross-Word Triphones (right context)
Sphinx III, the Acoustic Model – File List Summary mdef.c – definition of the basic phones and triphones HMMs, the mapping of each HMM state to a senone and to its transition matrix. dict.c – pronunciation dictionary structure. hmm.c – implementing HMM evaluation using Viterbi Search, which means fetching the best senone score. Note that the HMM data structures, defined at hmm.h, are hardwired to 2 possible HMM topologies – 3 / 5 state left-to-right HMMs. lextree.c – lexical tree search.
Presentation Resources: • Spoken Language Processing: A Guide to Theory, Algorithm and System Development by Xuedong Huang , Alex Acero , Hsiao-Wuen Hon , Raj Reddy (Hardcover, 980 pages; Publisher: Prentice Hall PTR; ISBN: 0130226165; 1st edition, April 25, 2001). Chapters 9,13. • Hwang, M., Huang, X., Alleva, F. : “Predicting Unseen Triphones with Senone”, 1993. • Hwang et al : Shared Distribution Hidden Markov Models for Speech Recognition, 1993. • Hwang et al : Subphonetic Modeling with Markov States – Senones, 1992. • Sphinx-III documentation - a presentation made by Mosur Ravishankar; found in the /doc/ folder of the sphinx-III package. • “Sphinx-III bible” - a presentation made by Edward Lin; http://www.ece.cmu.edu/~ffang/sis/documents/S3Bible.ppt
“I shall never believe that God plays dice with the world, but maybe machines should play dice with human capabilities…” John Doe