Building a Speech Recognizer Using SONIC: Detailed Procedures and Components

Using SONIC to build a speech recognizer Pellom & Hacioglu, ``Sonic: The University of Colorado Continuous Speech Recognizer,'' Center for Spoken Language Research Technical Report TR-CSLR-2001-01, U. Colorado, 2003 Presented by Yang Shao, CIS788K04 Wi04

Performance on standard tasks • On a 1.7GHz Pentium 4

Procedures • Preparation • identify the goal; • decide the recognition unit: phoneme, syllable, word etc; • preparing the corpus: training, development, testing; • label part of training data (opt). • etc.

Procedures cont. • Training • Acoustic model training; • Language model training; • Adaptation • Speaker adaptation (VTLN, MLLR, MAP); • Environment adaptation (mismatch of training and testing); • Testing

Acoustic model training • Feature extraction and iterative steps of viterbi state-based alignment and model estimation; • Outputs a set of decision-tree state-clustered HMMs;

Feature extraction (PMVDR) • Perceptual Minimum Variance Distortionless Response cepstral coefficients; • fea [options] speechfile.raw featurefile.fea • Dynamic features;

Language Model I • Finite state grammar in terms of a regular expression;

Language model II • Language model: • P(W) = P(w1, w2, …, wm) gives the probability of a given word sequence; • expanded as • N-gram • Calculated as • Bigram example: P(Mary loves that person) = P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)

Recognition overview • Speech-enabled applications can be built by calling functions within the Sonic API. • Sonic_batch –c config.txt [-l]

Configuration file • It is a text file that has a set of parameters followed by arguments to establish the basic settings of the recognizer. • location of the acoustic model files; • location of the language model file; • location of the pronunciation lexicon; • recognizer settings such as search beams, pruning settings, etc.; • (opt) a pointer to a control file containing a list of audio files to process.

Components • Audio file format: • 16-bit linear PCM format (raw); • sampling rate is configurable (8k default); • Phoneme configuration file format • support 55-phoneme symbol set adopted by CMU Sphinx-II speech recognizer.

Components cont. • LM format • support up to 4-gram language model • Pronunciation lexicon format • Acoustic model format • using binary files from trainer function; • <phoneme>.<state>-<context>, ex. AA.1-l;

Discussion • Unlike HTK, the trainer code estimates models for one base phone at a time. Potential problem?

Building a Speech Recognizer Using SONIC: Detailed Procedures and Components

Building a Speech Recognizer Using SONIC: Detailed Procedures and Components

Presentation Transcript

Build a Balanced Speech

Evaluating Speech Separation with a Speech Recognizer

Speech Recognizer Training

Building A Highly Accurate Mandarin Speech Recognizer

Isolated Digit Recognizer using GMM’s

Using Drupal to Build Applications

Recognizer Issues

Using Omni to Build Tools

Using Excel to Build a Budget

Segmental GPD training of HMM based speech recognizer

Building A Highly Accurate Mandarin Speech Recognizer

Building A Highly Accurate Mandarin Speech Recognizer

Isolated Digit Recognizer using GMM’s

Using SONIC to build a speech recognizer

BUILDING A HIGHLY ACCURATE MANDARIN SPEECH RECOGNIZER

BUILDING A HIGHLY ACCURATE MANDARIN SPEECH RECOGNIZER

How to build a Speech

A sonic tribute

Realizing an autonomous recognizer using data compression

How to Build a Powerful Persuasive Speech

Build a Balanced Speech

Speech Recognizer Training