170 likes | 301 Views
From Grammar to N-grams. Estimating N-grams From a Context-Free Grammar and Sparse Data. Thomas K Harris May 16, 2002. Motivation. Recognizers typically use n-grams. Systems are typically defined by CFGs. Data collection is difficult.
E N D
From Grammar to N-grams Estimating N-grams From a Context-Free Grammar and Sparse Data Thomas K Harris May 16, 2002
Motivation • Recognizers typically use n-grams. • Systems are typically defined by CFGs. • Data collection is difficult. • Goal: To have a language model that benefits from the grammar and the priors of the parses.
Other Approaches • Ignore data, use a language model derived from the grammar alone. • Ignore grammar, use a language model derived from the data alone. • Interpolate between these two models.
PCFG Strategy • Train grammar with some data. • Smooth grammar. • Compute n-grams. data PCFG N-grams CFG
The Software • Work in progress - available at http://www.cs.cmu.edu/~tkharris/pcfg • Written in C++ • A library (API) consists of a PCFG class and an n-gram class. • A program which uses the library to create n-grams from Phoenix grammars and data. • A make script to automate building and testing.
Procedure • Read Phoenix grammar file. • Convert to Chomsky Normal Form. • Read data and train grammar. • Smooth the grammar. • Compute n-grams from the smoothed PCFG.
Reading Phoenix Formats • Doesn’t handle #include directive. • Doesn’t handle +* (Kleen closure) marker. • Net – Rewrite distinction is ignored. • + and * markers are rewritten as rules. • Conversion to CNF permanently mangles rules.
Chomsky Normal Form • Remove ε-transitions. • Remove unit productions. • Change all rules A->βaγ of length >1 to A->βNγ and N->a. • Recursively shorten all rules A->βBC of length >2 to A->βN and N->BC.
Training • Initialize rule probabilities. • For each sentence, • Use CYK chart parser to compute inside and outside probabilities. • Use those probabilities to determine the expected number of times the rule is used in the sentence. • Use the expectations to get a new set of rule probabilities. • Repeat until the corpus likelihood appears to asymptote.
Smoothing • A user-specified probability mass can be redistributed over unseen rules. • At the bottom of the tree this generalizes a class-based model. • This only smoothes the trained grammar over other grammatical sentences.
Precise N-grams • Precise n-grams can be computed from a PCFG. • P(wn|w1…wn-1) = E(w1…wn|S)/E(w1…wn-1|S)
S S S A A A B B B Divide and Conquer S wn …w1-n… w1………….wn …w1-n…
Data • USI MovieLine oracle transcripts • 2,000 sentences • Used only parsable sentences (85%) • Divided into 60% training, 40% test
Conclusions • Lower perplexities than pure-grammar method, comparable perplexities to pure-data method. • More flexible and cheaper than pure-data methods.
Future Directions • More smoothing work needs to be done. • Different smoothing over different classes • other smoothing methods?? • Trigrams • Testing for word error rate improvements • Adapting to modified grammars