440 likes | 621 Views
>65536. Arthur Chan May 4, 2006. What so special about 65536?. 65536 = 2 ^ 16 Do you know? Sphinx III did not support language model with more than 65536 (2^16) words CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well.
E N D
>65536 Arthur Chan May 4, 2006
What so special about 65536? • 65536 = 2 ^ 16 • Do you know? • Sphinx III did not support language model with more than 65536 (2^16) words • CMU-Cambridge LM Toolkit V2 is not happy about text with 65536 unique words as well. • Though a word could have counts more than 65536 (-four_byte_counts)
Why 65536 was the limit? • Both Sphinx III and CMU-Cambridge LM Toolkit V2 was written in 95-99 • Time when having 64M RAM is extravagant • (Now 64G seems to be the number.) • (At that time, Pentium 166 or 200 are hot) • Programmers therefore designed clever structures to deal with memory issues • In Sphinx, DMP format was invented (WordID: 16bits) • In CMU-Cambridge LM Toolkit, 16 bits data types were used for wordID.
This Talk (30-35 pages) • Describe our effort on breaking the 16 bits limit in • Sphinx 3 • CMU-Cambridge LM Toolkit V2 • Half a talk • Features not fully tested in real-life • But the talk itself is quite long. • Technical Detail of the changes. • Sphinx III – The easy part (9 pages) • CMU-Cambridge LM Toolkit – The tough part (10 pages) • The Root of the Evil (11 pages) • Why does this problem exist? Why does it persist? • What if similar problem appear? How do we solve and avoid them?
Disclaimer about the Speaker • Notorious of being negative on Language Modeling Techniques • Symptom 1: Yell at others when his LM code has bugs. • Symptom 2: Yell at others <period> • He should be forgiven because • His Master Thesis Supervisor taught him that when he was young • Prof. Ronald Rosenfeld’s taught him the same • He also read Dr. Joshua Goodman’s papers
Terminology • “Probability” actually means • The estimate of the probability • Back-off weight means • When some n-gram is unseen in the training data • Back-off to (n-1)-gram “probability” times a weight • According to Manning, four-gram should be tetragram, bigram should be digram • Well, it’s lucky it doesn’t matter to us today
LM Component of Sphinx III The Easy Part
What Sphinx 3.6 RCI supports • ARPA LM • DMP LM • A memory efficient version of ARPA LM • Could be run in disk-mode as well • Class-based LM • Multiple LMs and LM switching dynamically • lm_convert • (new in 3.6!) Conversion tool for ARPA, DMP LM
A note on the DMP format • A tree like format • Bigram is indexed by prefix unigram • Trigram is indexed by prefix bigram. • Bigram, Trigram probabilities and back-of weights • Quantized to 4 decimal point. So you see following statements in the code:
Funny C statements in the Code /* HACK!! to quantize probs to 4 decimal digits */ p = p3*10000; p3 = p*0.0001; If you delete this, then the LM will be larger because quantization is not done.
Reasons why Sphinx III only supports less than 65536 words • 16 bits data structures for • Bigram • Trigram • Cache structure
Bogus Reason of Why Sphinx III doesn’t more than 65536 words • A very bad misconception • “The decoding is constrained by the dictionary” • WRONG • In both flat and tree lexicon search. Only LM words are traversed. • RIGHT • Generally, decoding is constrained by the intersection of the LM word and dictionary words
Several Proposed Surgery Procedure • 1, Rewrite the whole LM routine • Oops! But it takes too much time, • Old routine is very memory efficient • 2, Replace the old LM by just switching the type of data structure • Problem: All the binary LMs we generated have the old layout. • We will lose backward compatibility very badly
Final Solution • lm now support two data structures: 16 and 32 bits • lm_convert and decode will support two types of binaries LM • DMP that has a 16 bit layout • DMP32 that has a 32 bit layout • Magic version number will decide which layout to use • Regression test could ensure not bad code check-ins • When to use which format is hidden from • Any one called the lm routine. (for a few exceptions)
Partial Verification of the Code • The 16 bit and 32 bit code produce exactly the same decoding results for • decode • decode_anytop • (allphone’s trigram could probably left untest.) • A faked LM with more than 65536 words could be used and run in decode
Current Practical Limit • The lm data structure in lm.h • Theoretically support LM with • Less than 4 billion unigram • Less than 4 billion bigram • Less than 4 billion trigram • What if we have n-gram size larger than 4 billion? • Answer: we are dead people • Further answer: it is easily fixable • Other data structure from Sphinx 3? • hash.c doesn’t return prime number large than 900001 • Further answer: it is easily fixable as well
Conclusion • Technically • Sphinx III 32bit mode is not that difficult to take care. • The problem was also confined to one data structure • Thanks to the modular design of Sphinx III • Pretty easy to solve. • Sphinx III’s decision of using binary format • If I were Ravi, I will do that as well • Much faster loading time for large model.
CMU-Cambridge LM Toolkit V2 The Tough Part
CMU-Cambridge LM Toolkit Version 2 • LM Support of CMU-Cambridge LM Toolkit Version 2 • LM training • Parameter estimation with backoff weight computation • Support both • LM in ARPA format • LM in BINLM format • BINLM is not the same as DMP format. • bin2arpa could translate BINLM to ARPA
Purpose of the toolkit • Training LM for • Speech Recognition • Statistical Machine Translation • Document Classification • Hand writing Recognition • A note: • Occasionally, speech recognition is really not everything
Standard Procedure of Training • In V2 time, • David Huggins-Daines wasn’t in CMU • Training is separated into 4 stages • text2wfreq -> Find the word frequency table • wfreq2vocab • Find the vocabulary we need (smaller than the frequency table) • text2idngram • Convert the text to a stream of ngram and its count (idngram) • The ngram word id is alphabetically sorted • idngram2lm • Gather the counts, compute the discounted estimates and the backoff weights.
Reasons why V2 doesn’t support 65536 words • There is one single file that typedef many data structures • But the variables are not used very often • Most variables are not typedef. • Many of them are declared as • unsigned short • int
Another Issue……. • What if we have more than 4 billion n-gram this time? • e.g if n>5 • Not forgivable in LM training because • MT people are already having this problem. (unigram size is 5 million)
Strategy • Spent 90% of the time to make sure the data type was declared correctly • Give up taking care of both 16-bit and 32-bit binary layout together. • Compile time switch (THIRTYTWOBITS) is provided • Reasonable because users seldom used BINLM any way • User need to use DMP format in Sphinx III • Tool chain is now completed • Number of ngram is a 64 bit number
What we support now • One could trained an LM with more than 65536 words • text2wfreq, wfreq2vocab, text2idngram, idngram2lm are fixed • One could convert an LM • binlm2arpa, ngram2mgram are fixed • One could compute the perplexity of an LM and some statistics from the text • evallm, idngram2stats are fixed
Other Evils of Detail • V2’s hash table is using a very bad hash function • Many collisions • Legacy from pre-90s • One could take a 4 hour nap to load the word list if we train a 500k word model. • After using Dan J. Barstein’s hash function, • the load time is acceptable (<1 min) • Binary layout was one of the most time-consuming part of development
Verification • 16 bits and 32 bits code provides exactly the same results • 32 bit code could train a LM from a faked corpora with 10M unique words. • Note: we are talking about uniq words. • Both Dave perl tests and Arthur’s tests are all passed. • So, things like LM interpolation is actually working too.
Current Limitation • Theoretically support • 1.84 x10^19 ngrams. • The 4-step procedure used too much space • 100 M words training requires • 10 G harddisk • 1-2G RAM • 1G word training requires more • 100G harddisk and 20G RAM?
So, we still have issue when …… • Ascending order of difficulties • What if MT people asked us to run their LM in our recognizer (1M limit)? • What if we need to run decoding for 10 languages and each with 100k words? • What if we need to train a N word corpora (N= 1 billion) and there are N*N*N trigrams? • What if Prof. Jim Baker was back? • What if there were aliens?
Deliver Us From Evil Why this feature wasn’t implemented in 2001?
An Important Observation • There is and implicit Development Deadlock between • Sphinx III • CMU-Cambridge LM Toolkit • SphinxTrain
General Pattern (part I) • Decoder’s developer think • “Feature X is not implemented in the trainer” • “That is to say there will be no use if we implement feature X”. => Give up feature X
General Pattern (part II) • Trainer’s developers think • “Feature X is not implemented in the decoder” • “That is to say there will be no use if we implement feature X”. • Give up feature X
Why Feature X is not implemented in the first place? • Possible Reason 1: • In the past, someone analyze some results and conclude that Feature X is not useful • Possible Reason 2: • Because of theoretical reasons Y and Z, someone conclude that Feature X is not useful • Possible Reason 3: • Past hardware limitation
In Reality…… • Feature X could turn out to be very useful, • E.g. • More than 65k words in LM • N-gram when N > 3 • Interpolation (instead of backoff) in N-gram
Another Important Observation • Constant give up of new features • Eventually give up the whole software development • Look at CMU-Cambridge LM Toolkit V2
How we should deal with this Problem? • 1, Know that this is a problem • (From anonymous self-help books.) • 2, We need a joint understanding of both of the decoder and the trainer(s) • Question to ask: Is it really correct to always develop the decoder first? • 3, New features of the training could always be tested in cheap ways • N-best and Lattice rescoring • Then the deadlock will be broken on one side
A Unified View of Our Software CMU-Cambridge LM Toolkit SphinxTrain The Suite Sphinx Brothers {2,3,4} depends on where you live
Issue 1 • Q: “Do we have the right to change the LM Toolkit?” • A: “Yes, according to the license if we open the source for research purpose, we could change, distribute the code. • Our changes is endorsed by Prof. Rosenfeld (CMU), Dr. Clarkson (Cambridge) and Prof. Robinson (Cambridge)”
Issue 2 • Q: “Do we have anything new in LM?” • A: “That depends on the brilliance of our students and staffs. • Also generally brilliance of the public • They have the right to contribute • Actually, in past 10 years, • A lot of new thing were done in CMU in LM • Just no one collects them and put them together.”
Issue 3 • Q: “Are you just getting yourself a lot of trouble?” • A: “The troubles are always there, we just never face it. “
Digression: Project L • News: Some Folks are working on the LM toolkit now! • Project Code: L • Three key supporters • A Young Professor (or Prof. AB) • Hint:he is not exactly young • A Young Student (or DH) • A Young Staff (or AC) • Gathering code from around the world • Thanks for Professor Yannick from LIUM in advance • Thanks to contributor AT
Conclusion • 32 bits data structure now supported in both Sphinx III and CMU-Cambridge LM Toolkit. • This brings up a lot of development issue • May be we should take the LM toolkit more seriously • Maintenance (a must) • New features development (if we have time)
Preview of the next 2 talks • Project L • Story of The Three Young Developers • Development Progress of Sphinx 3.X (From X=3 to X=6) • What is the big picture of Sphinx?