230 likes | 379 Views
LING 696B: Maximum-Entropy and Random Fields. Review: two worlds. Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting)
E N D
Review: two worlds • Statistical model and OT seem to ask different questions about learning • UG: what is possible/impossible? • Hard-coded generalizations • Combinatorial optimization (sorting) • Statistical: among the things that are possible, what is likely/unlikely? • Soft-coded generalizations • Numerical optimization • Marriage of the two?
Review: two worlds • OT: relate possible/impossible patterns in different languages through constraint reranking • Stochastic OT: consider a distribution over all possible grammars to generate variation • Today: model frequency of input/output pairs (among the possible) directly using a powerful model
Maximum entropy and OT • Imaginary data: • Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each • Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}
Maximum entropy • Why have Z? • Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1 • So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant • Z can quickly become difficult to compute, when number of candidates is large • Very similar proposal in Smolensky, 86 • How to get w1, w2? • Learned from data (by calculating gradients) • Need: frequency counts, violation vectors (same as stochastic OT)
Maximum entropy • Why do exp{.}? • It’s like take maximum, but “soft” -- easy to differentiate and optimize
Maximum entropy and OT • Inputs are violation vectors: e.g. x=(2,0) and (0,1) • Outputs are one of K winners -- essentially a classification problem • Violating a constraint works against the candidate (prob ~ exp{-(x1*w1 + x2*w2)} • Crucial difference: ordering candidates by one score, not by lexico-graphic orders
Maximum entropy • Ordering discrete outputs from input vectors is a common problem: • Also called Logistic Regression (recall Nearey) • Explaining the name: • Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1 Logistic transform Linear regression
The power of Maximum Entropy • Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs • Recall Nearey: phones, diphones, … • NLP: tagging, labeling, parsing … (anything with a discrete output) • Easy to learn: only a global maximum, optimization efficient • Isn’t this the greatest thing in the world? • Need to understand the story behind the exp{} (in a few minutes)
Demo: Spanish diminutives • Data from Arbisi-Kelm • Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle
Stochastic OT and Max-Ent • Is better fit always a good thing?
Stochastic OT and Max-Ent • Is better fit always a good thing? • Should model-fitting become a new fashion in phonology?
The crucial difference • What are the possible distributions of p(.|/bap/) in this case?
The crucial difference • What are the possible distributions of p(.|/bap/) in this case? • Max-Ent considers a much wider range of distributions
What is Maximum Entropy anyway? • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy • Given a dice, which distribution has the largest entropy?
What is Maximum Entropy anyway? • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy • Given a dice, which distribution has the largest entropy? • Add constraints to distributions: the average of some feature functions is assumed to be fixed: Observed value
What is Maximum Entropy anyway? • Example of features: violations, word counts, N-grams, co-occurrences, … • The constraints change the shape of the maximum entropy distribution • Solve constrained optimization problem • This leads to p(x) ~ exp{k wk*fk(x)} • Very general (see later), many choices of fk
The basic intuition • Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x)) • Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) • Common practice in NLP • This is better seen as a “descriptive” model
Going towards Markov random fields • Maximum entropy applied to conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)} • There can be many creative ways of extracting features fk(x,y) • One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique • Known as Markov network/random field
Conditional random field • Impose a chain-structured graph, and assign features to edges • Still a max-ent, same calculation m(yi, yi+1) f(xi, yi)
Wilson’s idea • Isn’t this a familiar picture in phonology? m(yi, yi+1) -- Markedness Surface form f(xi, yi) Faithfulness Underlying form
The story of smoothing • In Max-Ent models, the weights can get very large and “over-fit” the data (see demo) • Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights • Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning • Constraints that force less similarity --> a higher penalty for them to change value