150 likes | 301 Views
A new framework for Language Model Training. David Huggins-Daines January 19, 2006. Overview. Current tools Requirements for new framework User Interface Examples Design and API. Current status of LM training. The CMU SLM toolkit Efficient implementation of basic algorithms
E N D
A new framework for Language Model Training David Huggins-Daines January 19, 2006
Overview • Current tools • Requirements for new framework • User Interface Examples • Design and API
Current status of LM training • The CMU SLM toolkit • Efficient implementation of basic algorithms • Doesn’t handle all tasks of building a LM • Text normalization • Vocabulary selection • Interpolation/adaptation • Requires an expert to “put the pieces together” • Lots of scripts • SimpleLM, Communicator, CALO, etc. • Other LM toolkits • SRILM, Lemur, others?
Requirements • LM training should be • Repeatable • An “end-to-end” rebuild should produce the same result • Configurable • It should be easy to change parameters and rebuild the entire model to see their effect • Flexible • Should support many types of source texts, methods of training • Extensible • Modular structure to allow new methods and data sources to be easily implemented
Tasks of building an LM • Normalize source texts • They come in many different formats! • LM toolkit expects a stream of words • What is a “word”? • Compound words, acronyms • Non-lexemes (filler words, pauses, disfluencies) • What is a “sentence”? • Segmentation of input data • Annotate source texts with class tags • Select a vocabulary • Determine optimal vocabulary size • Collect words from training texts • Define vocabulary classes • Vocabulary closure • Build a dictionary (pronunciation modeling)
Tasks, continued • Estimate N-Gram model(s) • Choose the appropriate smoothing parameters • Find the appropriate divisions of the training set • Interpolate N-Gram models • Use a held-out set representative of the test set • Find weights for different models which maximize likelihood (minimize perplexity) on this domain • Evaluate language model • Jointly minimize perplexity and OOV rate • (they tend to move in opposite directions)
A Simple Switchboard Example Top level tag - must be only one <NGramModel> <Transcripts name="swb.files"> <InputFilter::SWB> <Transcripts list="swb.files"/> </InputFilter::SWB> </Transcripts> <Vocabulary cutoff="1"> <Transcripts name="swb.files"/> </Vocabulary> </NGramModel> A set of transcripts The input filter to use A list of files Exclude singletons Backreference to named object
A More Complicated Example <NGramModel name="interp.test"> <Transcripts name="swb.test"> swb.test.lsn </Transcripts> <Transcripts name="icsi.test"> <InputFilter::ICSI> icsi.test.mrt </InputFilter::ICSI> </Transcripts> <Vocabulary name="icsi.swb1"> <Vocabulary cutoff="1"> <Transcripts name="swb.test"/> </Vocabulary> <Vocabulary> <Transcripts name="icsi.test"/> </Vocabulary> BRAZIL </Vocabulary> <NGramModel name="swb.test"> <Transcripts name="swb.test"/> <Vocabulary name="icsi.swb1"/> </NGramModel> <NGramModel name="icsi.test"> <Transcripts name="icsi.test"/> <Vocabulary name="icsi.swb1"/> </NGramModel> <Interpolation> <InputFilter::CMU> cmu.test.trs </InputFilter::CMU> <NGramModel name="swb.test"/> <NGramModel name="icsi.test"/> </Interpolation> </NGramModel> (Interpolation of ICSI and Switchboard) Files can be listed directly in element contents Vocabularies can be nested (merged) Words can be listed directly in element contents Held-out set for interpolation Interpolate previously named LMs
Command-line Interface • lm_train • “Runs” an XML configuration file • build_vocab • Build vocabularies, normalize transcripts • ngram_train • Train individual N-Gram models • ngram_test • Evaluate N-Gram models • ngram_interpolate • Interpolate and combine N-Gram models • ngram_pronounce • Build a pronunciation lexicon from a language model or vocabulary
Programming Interface • NGramFactory • Builds an NGramModel from an XML specification (as seen previously) • NGramModel • Trains a single N-Gram LM from some transcripts • Vocabulary • Builds a vocabulary from transcripts or other vocabularies • InputFilter • Subclassed into InputFilter::CMU, InputFilter::ICSI, InputFilter::HUB5, InputFilter::ISL, etc • Reads transcripts in some format and outputs a word stream
Design in Plain English • NGramFactory builds an NGramModel • NGramModel has a Vocabulary • NGramModel and Vocabulary can have Transcripts • NGramModel and Vocabulary use an InputFilter (or maybe they don’t) • NGramModel can merge two other NGramModels using a set of Transcripts • Vocabulary can merge another Vocabulary
A very simple InputFilter please!!! (InputFilter/Simple.pm) use strict; package InputFilter::Simple; require InputFilter; use base 'InputFilter'; sub process_transcript { my ($self, $file) = @_; local ($_, *FILE); open FILE, "<$file" or die "Failed to open $file: $!"; while (<FILE>) { chomp; my @words = split; $self->output_sentence(\@words); } } 1; Subclass of InputFilter (This is just good practice) Read the input file Tokenize, normalize, etc Pass each sentence to this method
Where to get it • Currently in CVS on fife.speech • :ext:fife.speech.cs.cmu.edu:/home/CVS • module LMTraining • Future: CPAN and cmusphinx.org • Possibly integrated with the CMU SLM toolkit in the future
Stuff TODO • Class LM support • Communicator-style class tags are recognized and supported • NGramModel will build .lmctl and .probdef files • However this requires normalizing the files to a transcript first, then running the semi-automatic Communicator tagger • Automatic tagging would be nice… • Support for languages other than English • Text normalization conventions • Word segmentation (for Asian languages) • Character set support (case conversions etc) • Unicode (also a CMU-SLM problem)