130 likes | 161 Views
This document provides detailed procedures and components for building a speech recognizer using SONIC technology. It covers preparation steps, training processes, adaptation techniques, testing procedures, feature extraction methods, language modeling, and recognition overview. Components such as audio file format, phoneme configuration, LM format, pronunciation lexicon, and acoustic model are discussed with relevant details.
E N D
Using SONIC to build a speech recognizer Pellom & Hacioglu, ``Sonic: The University of Colorado Continuous Speech Recognizer,'' Center for Spoken Language Research Technical Report TR-CSLR-2001-01, U. Colorado, 2003 Presented by Yang Shao, CIS788K04 Wi04
Performance on standard tasks • On a 1.7GHz Pentium 4
Procedures • Preparation • identify the goal; • decide the recognition unit: phoneme, syllable, word etc; • preparing the corpus: training, development, testing; • label part of training data (opt). • etc.
Procedures cont. • Training • Acoustic model training; • Language model training; • Adaptation • Speaker adaptation (VTLN, MLLR, MAP); • Environment adaptation (mismatch of training and testing); • Testing
Acoustic model training • Feature extraction and iterative steps of viterbi state-based alignment and model estimation; • Outputs a set of decision-tree state-clustered HMMs;
Feature extraction (PMVDR) • Perceptual Minimum Variance Distortionless Response cepstral coefficients; • fea [options] speechfile.raw featurefile.fea • Dynamic features;
Language Model I • Finite state grammar in terms of a regular expression;
Language model II • Language model: • P(W) = P(w1, w2, …, wm) gives the probability of a given word sequence; • expanded as • N-gram • Calculated as • Bigram example: P(Mary loves that person) = P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)
Recognition overview • Speech-enabled applications can be built by calling functions within the Sonic API. • Sonic_batch –c config.txt [-l]
Configuration file • It is a text file that has a set of parameters followed by arguments to establish the basic settings of the recognizer. • location of the acoustic model files; • location of the language model file; • location of the pronunciation lexicon; • recognizer settings such as search beams, pruning settings, etc.; • (opt) a pointer to a control file containing a list of audio files to process.
Components • Audio file format: • 16-bit linear PCM format (raw); • sampling rate is configurable (8k default); • Phoneme configuration file format • support 55-phoneme symbol set adopted by CMU Sphinx-II speech recognizer.
Components cont. • LM format • support up to 4-gram language model • Pronunciation lexicon format • Acoustic model format • using binary files from trainer function; • <phoneme>.<state>-<context>, ex. AA.1-l;
Discussion • Unlike HTK, the trainer code estimates models for one base phone at a time. Potential problem?