SRILM Based Language Model

SRILM Based Language Model Name:Venkatasubramanyansundaresan Instructor:Dr.VetonKepuska

N-GRAM Concept • The idea of word prediction in formalized with probabilistic model called N-gram. • Statistical models of word sequence are also called language models or LM’S • The idea of N-gram model is to approximate the history by just the last few words .

CORPUS • Counting things in natural language is based on a corpus. • What is a corpus ? • It is an online collection of text or speech • There are two popular corpora. • Brown (1 million word collection ) • Switch board (Collection 2430 telephone conversation )

Perplexity • Perplexity is interpreted as the weighted average branching factor of a language. • Branching factor of a language is the number of possible next word that can follow any word . • Perplexity is the most common evaluation metric for N-gram language models . • Improvement in perplexity does not guarantee an improvement in speech recognition performance. • It is commonly used as a quick check of an algorithm.

SMOOTHING • It is the process of flattening a probability distribution implied by a language model ,so the all reasonable word sequence can occur with some probability.

Aspiration • To use SRI-LM (LM-Language modeling) toolkit to build different language models. • The following are the language models : • Good –turning Smoothing • Absolute Discounting

Linux Environment in Windows • To implement Linux environment in windows operating system we have to install “cygwin” • This is a open source software and can be downloaded from : www.cygwin.com. • Another main reason for installing cygwin is ,SRI-LM can be implemented over the cygwin platform .

Installation Procedure “cygwin” • Go to the provided webpage. • Download the setup file . • Select “install from Internet” • Give the required destination place for the cygwin to get installed . • There will be a lot of options to download from website. • Select one site and install all the packages .

SRILM • Download the SRILM toolkit ,srilm.tgz from the following source: • http://www.speech.sri.com/projects/srilm/ • Run the terminal window of Cygwin. • The srilm will be downloaded as a zip file . • Unzip the srilm file inside the cygwin environment • Unzip canbe done with the following with the following command: tar zxvf srilm.tgz

SRILM Installation • Once the installation is completed ,we have to edit the makefile in the cygwin folder . • Once the editing is done , we have run the cygwin ,to install SRILM in cygwin : $ Make World

Function of SRILM • Generate N-gram count from the corpus • Train language model based on the N-gram count file . • Use trained language model to calculate test data perplexity.

Lexicon • Lexicon is a container of words belonging to the same language . • Reference: Wikipedia

Lexicon Generation • Use “wordtokenization.pl” file to generate the Lexicon for our requirement . • Generate lexicon of our requirement using the following command: • cat train/en_*.txt > corpus.txt • Perl wordtokenization.pl <corpus.txt|sort|uniq >lexicon.txt

Count File • Generate 3-gram count file by using following command: • $./ngram-count –vocab lecicon.txt, -text corpus.txt ,-order 2 –write count.txt, -unk

Good-Turing Language Model $ ./ngram-count-readproject/count.txt -order3 -lmproject/gtlm.txt -gt1min 1 -gt1max3 -gt2min 1 -gt2max3-gt3min1 -gt3max 3 • This code has to be typed in the command window of the terminal . • -lmlmfile Estimate a back off N-gram model from the total counts, and write it to lmfile

Absolute Discounting Language Model $ ./ngram-count-readproject/count.txt-order 3-lm adlm.txt-cdiscount1 0.5-cdiscount2 0.5 -cdiscount3 0.5 • Here the order N can be any thing b/w 1 to 9.

SRILM Based Language Model