A new framework for Language Model Training

A new framework for Language Model Training David Huggins-Daines January 19, 2006

Overview • Current tools • Requirements for new framework • User Interface Examples • Design and API

Current status of LM training • The CMU SLM toolkit • Efficient implementation of basic algorithms • Doesn’t handle all tasks of building a LM • Text normalization • Vocabulary selection • Interpolation/adaptation • Requires an expert to “put the pieces together” • Lots of scripts • SimpleLM, Communicator, CALO, etc. • Other LM toolkits • SRILM, Lemur, others?

Requirements • LM training should be • Repeatable • An “end-to-end” rebuild should produce the same result • Configurable • It should be easy to change parameters and rebuild the entire model to see their effect • Flexible • Should support many types of source texts, methods of training • Extensible • Modular structure to allow new methods and data sources to be easily implemented

Tasks of building an LM • Normalize source texts • They come in many different formats! • LM toolkit expects a stream of words • What is a “word”? • Compound words, acronyms • Non-lexemes (filler words, pauses, disfluencies) • What is a “sentence”? • Segmentation of input data • Annotate source texts with class tags • Select a vocabulary • Determine optimal vocabulary size • Collect words from training texts • Define vocabulary classes • Vocabulary closure • Build a dictionary (pronunciation modeling)

Tasks, continued • Estimate N-Gram model(s) • Choose the appropriate smoothing parameters • Find the appropriate divisions of the training set • Interpolate N-Gram models • Use a held-out set representative of the test set • Find weights for different models which maximize likelihood (minimize perplexity) on this domain • Evaluate language model • Jointly minimize perplexity and OOV rate • (they tend to move in opposite directions)

A Simple Switchboard Example Top level tag - must be only one <NGramModel> <Transcripts name="swb.files"> <InputFilter::SWB> <Transcripts list="swb.files"/> </InputFilter::SWB> </Transcripts> <Vocabulary cutoff="1"> <Transcripts name="swb.files"/> </Vocabulary> </NGramModel> A set of transcripts The input filter to use A list of files Exclude singletons Backreference to named object

A More Complicated Example <NGramModel name="interp.test"> <Transcripts name="swb.test"> swb.test.lsn </Transcripts> <Transcripts name="icsi.test"> <InputFilter::ICSI> icsi.test.mrt </InputFilter::ICSI> </Transcripts> <Vocabulary name="icsi.swb1"> <Vocabulary cutoff="1"> <Transcripts name="swb.test"/> </Vocabulary> <Vocabulary> <Transcripts name="icsi.test"/> </Vocabulary> BRAZIL </Vocabulary> <NGramModel name="swb.test"> <Transcripts name="swb.test"/> <Vocabulary name="icsi.swb1"/> </NGramModel> <NGramModel name="icsi.test"> <Transcripts name="icsi.test"/> <Vocabulary name="icsi.swb1"/> </NGramModel> <Interpolation> <InputFilter::CMU> cmu.test.trs </InputFilter::CMU> <NGramModel name="swb.test"/> <NGramModel name="icsi.test"/> </Interpolation> </NGramModel> (Interpolation of ICSI and Switchboard) Files can be listed directly in element contents Vocabularies can be nested (merged) Words can be listed directly in element contents Held-out set for interpolation Interpolate previously named LMs

Command-line Interface • lm_train • “Runs” an XML configuration file • build_vocab • Build vocabularies, normalize transcripts • ngram_train • Train individual N-Gram models • ngram_test • Evaluate N-Gram models • ngram_interpolate • Interpolate and combine N-Gram models • ngram_pronounce • Build a pronunciation lexicon from a language model or vocabulary

Programming Interface • NGramFactory • Builds an NGramModel from an XML specification (as seen previously) • NGramModel • Trains a single N-Gram LM from some transcripts • Vocabulary • Builds a vocabulary from transcripts or other vocabularies • InputFilter • Subclassed into InputFilter::CMU, InputFilter::ICSI, InputFilter::HUB5, InputFilter::ISL, etc • Reads transcripts in some format and outputs a word stream

Design in Plain English • NGramFactory builds an NGramModel • NGramModel has a Vocabulary • NGramModel and Vocabulary can have Transcripts • NGramModel and Vocabulary use an InputFilter (or maybe they don’t) • NGramModel can merge two other NGramModels using a set of Transcripts • Vocabulary can merge another Vocabulary

A very simple InputFilter please!!! (InputFilter/Simple.pm) use strict; package InputFilter::Simple; require InputFilter; use base 'InputFilter'; sub process_transcript { my ($self, $file) = @_; local ($_, *FILE); open FILE, "<$file" or die "Failed to open $file: $!"; while (<FILE>) { chomp; my @words = split; $self->output_sentence(\@words); } } 1; Subclass of InputFilter (This is just good practice) Read the input file Tokenize, normalize, etc Pass each sentence to this method

Where to get it • Currently in CVS on fife.speech • :ext:fife.speech.cs.cmu.edu:/home/CVS • module LMTraining • Future: CPAN and cmusphinx.org • Possibly integrated with the CMU SLM toolkit in the future

Stuff TODO • Class LM support • Communicator-style class tags are recognized and supported • NGramModel will build .lmctl and .probdef files • However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger • Automatic tagging would be nice… • Support for languages other than English • Text normalization conventions • Word segmentation (for Asian languages) • Character set support (case conversions etc) • Unicode (also a CMU-SLM problem)

Questions?

A new framework for Language Model Training

A new framework for Language Model Training

Presentation Transcript

The “job description, competency framework, training framework” model in initial vocational training

A Framework for Agricultural Model Development

A Systematic Framework for Language Analysis

New Biometric Framework and Driver Model

Model Framework

A New Model for Introductory Spanish

A Framework for Research Training in Communities

A Framework for Teaching Charlotte Danielson’s Model

Delaware “ A Framework for Teaching” Teacher Training

A New Model for the Cloud

A Rigorous Framework for Model-Driven Development

A Rigorous Framework for Model-Driven Development

A Model for EAP Training Development

A New Bigram-PLSA Language Model for Speech Recognition

A National Framework for Language Learning

TheDataWeb: a New Framework for Data

A Framework for Agricultural Model Development

A Framework for Agricultural Model Development

A new model for cataloguing

TheDataWeb: a New Framework for Data