260 likes | 331 Views
Using CTW as a language modeler in Dasher. Phil Cowans, Martijn van Veen 25-04-2007 Inference Group Department of Physics University of Cambridge. Language Modelling. Goal is to produce a generative model over strings Typically sequential predictions: Finite context models:.
E N D
Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen 25-04-2007 Inference Group Department of Physics University of Cambridge
Language Modelling • Goal is to produce a generative model over strings • Typically sequential predictions: • Finite context models:
Dasher: Language Model • Conditional probability for each alphabet symbol, given the previous symbols • Similar to compression methods • Requirements: • Sequential • Fast • Adaptive • Model is trained • Better compression -> faster text input
Basic Language Model • Independent distributions for each context • Use Dirichlet prior • Makes poor use of data • intuitively expect similarities between similar contexts
Prediction By Partial Match • Associate a generative distribution with each leaf in the context tree • Share information between nodes using a hierarchical Dirichlet (or Pitman-Yor) prior • In practice use a fast, but generally good, approximation
Context Tree Weighting • Combine nodes in the context tree • Tree structure treated as a random variable • Contexts associated with each leaf have the same generative distribution • Contexts associated with different leaves are independent • Dirichlet prior on generative distributions
CTW: Tree model • Source structure in the model, memoryless parameters
Children share one distribution Children distributed independently Recursive Definition
Observations So Far • No clear overall winner without modification. • PPM Does better with small alphabets? • PPM Initially learns faster? • CTW is more forgiving with redundant symbols?
CTW for text Properties of text generating sources: • Large alphabet, but in any given context only a small subset is used • Waste of code space, many probabilities that should be zero • Solution: • Adjust zero-order estimator to decrease probability of unlikely events • Binary decomposition • Only locally stationary • Limit the counts to increase adaptivity • Bell, Cleary, Witten 1989
Binary Decomposition • Decomposition tree
Binary Decomposition • Results found by Aberg and Shtarkov: • All tests with full ASCII alphabet
Count halving • If one count reaches a maximum, divide both counts by 2 • Forget older input data, increase adaptivity • In Dasher: Predict user input with a model based on training text • Adaptivity even more important
Combining PPM and CTW • Select locally best model, or weight models together • More alpha parameters for PPM, learned from data • PPM like sharing, with prior over context trees, as with CTW
Conclusions • PPM and CTW have different strengths, makes sense to try combining them • Decomposition and count scaling may give clues for improving PPM • Look at performance on out of domain text in more detail
Experimental Parameters • Context depth: 5 • Smoothing: 5% • PPM – alpha: 0.49, beta: 0.77 • CTW – w: 0.05, alpha: 1/128
Comparing language models • PPM • Quickly learns repeating strings • CTW • Works on a set of all possible tree models • Not sensitive to parameter D, max. model depth • Easy to increase adaptivity • The weight factor (escape probability) is strictly defined