130 likes | 218 Views
Using SONIC to build a speech recognizer. Pellom & Hacioglu, ``Sonic: The University of Colorado Continuous Speech Recognizer,'' Center for Spoken Language Research Technical Report TR-CSLR-2001-01, U. Colorado, 2003. Presented by Yang Shao, CIS788K04 Wi04. Performance on standard tasks.
E N D
Using SONIC to build a speech recognizer Pellom & Hacioglu, ``Sonic: The University of Colorado Continuous Speech Recognizer,'' Center for Spoken Language Research Technical Report TR-CSLR-2001-01, U. Colorado, 2003 Presented by Yang Shao, CIS788K04 Wi04
Performance on standard tasks • On a 1.7GHz Pentium 4
Procedures • Preparation • identify the goal; • decide the recognition unit: phoneme, syllable, word etc; • preparing the corpus: training, development, testing; • label part of training data (opt). • etc.
Procedures cont. • Training • Acoustic model training; • Language model training; • Adaptation • Speaker adaptation (VTLN, MLLR, MAP); • Environment adaptation (mismatch of training and testing); • Testing
Acoustic model training • Feature extraction and iterative steps of viterbi state-based alignment and model estimation; • Outputs a set of decision-tree state-clustered HMMs;
Feature extraction (PMVDR) • Perceptual Minimum Variance Distortionless Response cepstral coefficients; • fea [options] speechfile.raw featurefile.fea • Dynamic features;
Language Model I • Finite state grammar in terms of a regular expression;
Language model II • Language model: • P(W) = P(w1, w2, …, wm) gives the probability of a given word sequence; • expanded as • N-gram • Calculated as • Bigram example: P(Mary loves that person) = P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)
Recognition overview • Speech-enabled applications can be built by calling functions within the Sonic API. • Sonic_batch –c config.txt [-l]
Configuration file • It is a text file that has a set of parameters followed by arguments to establish the basic settings of the recognizer. • location of the acoustic model files; • location of the language model file; • location of the pronunciation lexicon; • recognizer settings such as search beams, pruning settings, etc.; • (opt) a pointer to a control file containing a list of audio files to process.
Components • Audio file format: • 16-bit linear PCM format (raw); • sampling rate is configurable (8k default); • Phoneme configuration file format • support 55-phoneme symbol set adopted by CMU Sphinx-II speech recognizer.
Components cont. • LM format • support up to 4-gram language model • Pronunciation lexicon format • Acoustic model format • using binary files from trainer function; • <phoneme>.<state>-<context>, ex. AA.1-l;
Discussion • Unlike HTK, the trainer code estimates models for one base phone at a time. Potential problem?