1 / 13

Building a Speech Recognizer Using SONIC: Detailed Procedures and Components

This document provides detailed procedures and components for building a speech recognizer using SONIC technology. It covers preparation steps, training processes, adaptation techniques, testing procedures, feature extraction methods, language modeling, and recognition overview. Components such as audio file format, phoneme configuration, LM format, pronunciation lexicon, and acoustic model are discussed with relevant details.

dalegreen
Download Presentation

Building a Speech Recognizer Using SONIC: Detailed Procedures and Components

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using SONIC to build a speech recognizer Pellom & Hacioglu, ``Sonic: The University of Colorado Continuous Speech Recognizer,'' Center for Spoken Language Research Technical Report TR-CSLR-2001-01, U. Colorado, 2003 Presented by Yang Shao, CIS788K04 Wi04

  2. Performance on standard tasks • On a 1.7GHz Pentium 4

  3. Procedures • Preparation • identify the goal; • decide the recognition unit: phoneme, syllable, word etc; • preparing the corpus: training, development, testing; • label part of training data (opt). • etc.

  4. Procedures cont. • Training • Acoustic model training; • Language model training; • Adaptation • Speaker adaptation (VTLN, MLLR, MAP); • Environment adaptation (mismatch of training and testing); • Testing

  5. Acoustic model training • Feature extraction and iterative steps of viterbi state-based alignment and model estimation; • Outputs a set of decision-tree state-clustered HMMs;

  6. Feature extraction (PMVDR) • Perceptual Minimum Variance Distortionless Response cepstral coefficients; • fea [options] speechfile.raw featurefile.fea • Dynamic features;

  7. Language Model I • Finite state grammar in terms of a regular expression;

  8. Language model II • Language model: • P(W) = P(w1, w2, …, wm) gives the probability of a given word sequence; • expanded as • N-gram • Calculated as • Bigram example: P(Mary loves that person) = P(Mary|<s>)P(loves|Mary)P(that|loves)P(person|that)

  9. Recognition overview • Speech-enabled applications can be built by calling functions within the Sonic API. • Sonic_batch –c config.txt [-l]

  10. Configuration file • It is a text file that has a set of parameters followed by arguments to establish the basic settings of the recognizer. • location of the acoustic model files; • location of the language model file; • location of the pronunciation lexicon; • recognizer settings such as search beams, pruning settings, etc.; • (opt) a pointer to a control file containing a list of audio files to process.

  11. Components • Audio file format: • 16-bit linear PCM format (raw); • sampling rate is configurable (8k default); • Phoneme configuration file format • support 55-phoneme symbol set adopted by CMU Sphinx-II speech recognizer.

  12. Components cont. • LM format • support up to 4-gram language model • Pronunciation lexicon format • Acoustic model format • using binary files from trainer function; • <phoneme>.<state>-<context>, ex. AA.1-l;

  13. Discussion • Unlike HTK, the trainer code estimates models for one base phone at a time. Potential problem?

More Related