1 / 20

The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir

The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir. Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science. Theoretical Background – Unit Selection.

yale
Download Presentation

The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Acoustic/Lexical model:Exploring the phonetic units; Triphones/Senones in action.Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science

  2. Theoretical Background – Unit Selection When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. • Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary continuous SR: • Each word is treated individually – no data sharing, which implies large amount of training data and storage. • The recognition vocabulary may consist of words which have never been given in the training data. • Expensive to model interword coarticulation effects.

  3. Theoretical Background - Phonemes The alternative unit is a Phoneme. Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones.

  4. Theoretical Background - Triphones The Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes. Triphones are an example of allophones. This model captures the most important coarticulatory effects, a fact which makes him a very powerful model. The cost – as context-dependent models generally increase the number of parameters, the trainability becomes much harder. Notice that in English there are more than 100,000 triphones ! Nevertheless, so far we have assumed that every triphone context is different. We are motivated to finds instances of similar contexts and merge them.

  5. Theoretical Background - Senones Recall that each allophone model is an HMM, made of states, transitions and probability distributions; the bottom line is that some distributions can be tied. The basic idea is clustering, but rather than clustering the HMM models themselves – we shall cluster only the the HMM states. Each cluster will represent a set of similar Markov states, and is called a Senone. The senones provide not only an improved recognition accuracy, but also a pronunciation-optimization capability.

  6. Theoretical Background – Senonic Trees Reminder: a decision tree is a binary tree which classifies target objects by asking Yes/No questions in a hierarchical manner. The senonic decision tree classifies Markov states of triphones, represented in the training data, by asking linguistic questions. => The leaves of the senonic trees are the possible senones.

  7. 12 elements 12 elements 39 elements Cepstrum Time-der Cepstrum Gaussian Mixtures 12 elements Current frame Time-2-der Cepstrum Mean, Variance, Determinant 3 elements Feature vectors and their analysis are inputs into Gaussian Mixtures Fitting Process. Power Fetch phonetic data (Senones !) from these Gaussian Mixtures – using the well-trained machine. Sphinx III, A Short Review – Front End Feature Extraction 7 frame speech window Senones Data (Scoring Table)

  8. W AH N ONE T UW TWO TH R IY THREE 5-state HMM Sphinx III – the implementation Handling a single word; evaluating each HMM according to the input, using the Viterbi Search. Every senone gets a HMM:

  9. The Viterbi Search - basics • Instantaneous score: how well a given HMM state matches the feature vector. • Path: A sequence of HMM states traversed during a given segment of feature vectors. • Path-score: Product of instantaneous scores and state transition probabilities corresponding to a given path. • The Viterbi search: An efficient lattice structure and algorithm for computing the best path score for a given segment of feature vectors.

  10. The Viterbi Search - demo Initial state initialized with path-score = 1.0 time

  11. State with best path-score State with path-score < best State without a valid path-score P (t) = max [P (t-1) a b (t)] j i ij j i The Viterbi Search (demo-contd.) State transition probability, i to j Score for state j, given the input at time t Total path-score ending up at state j at time t time

  12. time The Viterbi Search (demo-contd.)

  13. W AH N ONE T UW TWO TH R IY THREE Continuous Speech Recognition Add transitions from word ends to beginnings, and run the Viterbi Search.

  14. Separate N HMM instances for each possible right context W AH N ONE Context- dependent AH HMM Inherited left context propagated along with path-scores, and dynamically modifies the state model. Cross-Word Triphone Modeling Sphinx III uses “triphone” or “phoneme-in-context” HMMs; Remember to inject left-context into entry state.

  15. R TD start IX NG starting DX S T AA IX DD started PD startup R T AX PD start-up Sphinx-III - Lexical Tree Structure Nodes shared if triphone Senone-Sequence-ID (SSID) identical: START S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD

  16. R TD left-contexts start IX NG starting DX S T AA IX DD started PD startup S-models for different left contexts R T AX PD start-up to rest of lextree Cross-Word Triphones (left context) Root nodes replicated for left context. Nodes are shared if SSIDs are identical.

  17. Triphones for all right contexts Composite SSID model Leaf node composite states; average of component states Picking states HMM states for triphones Cross-Word Triphones (right context)

  18. Sphinx III, the Acoustic Model – File List Summary mdef.c – definition of the basic phones and triphones HMMs, the mapping of each HMM state to a senone and to its transition matrix. dict.c – pronunciation dictionary structure. hmm.c – implementing HMM evaluation using Viterbi Search, which means fetching the best senone score. Note that the HMM data structures, defined at hmm.h, are hardwired to 2 possible HMM topologies – 3 / 5 state left-to-right HMMs. lextree.c – lexical tree search.

  19. Presentation Resources: • Spoken Language Processing: A Guide to Theory, Algorithm and System Development by Xuedong Huang , Alex Acero , Hsiao-Wuen Hon , Raj Reddy (Hardcover, 980 pages; Publisher: Prentice Hall PTR; ISBN: 0130226165; 1st edition, April 25, 2001). Chapters 9,13. • Hwang, M., Huang, X., Alleva, F. : “Predicting Unseen Triphones with Senone”, 1993. • Hwang et al : Shared Distribution Hidden Markov Models for Speech Recognition, 1993. • Hwang et al : Subphonetic Modeling with Markov States – Senones, 1992. • Sphinx-III documentation - a presentation made by Mosur Ravishankar; found in the /doc/ folder of the sphinx-III package. • “Sphinx-III bible” - a presentation made by Edward Lin; http://www.ece.cmu.edu/~ffang/sis/documents/S3Bible.ppt

  20. “I shall never believe that God plays dice with the world, but maybe machines should play dice with human capabilities…” John Doe

More Related