1 / 23

LING 696B: Maximum-Entropy and Random Fields

LING 696B: Maximum-Entropy and Random Fields. Review: two worlds. Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting)

gretel
Download Presentation

LING 696B: Maximum-Entropy and Random Fields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 696B: Maximum-Entropy and Random Fields

  2. Review: two worlds • Statistical model and OT seem to ask different questions about learning • UG: what is possible/impossible? • Hard-coded generalizations • Combinatorial optimization (sorting) • Statistical: among the things that are possible, what is likely/unlikely? • Soft-coded generalizations • Numerical optimization • Marriage of the two?

  3. Review: two worlds • OT: relate possible/impossible patterns in different languages through constraint reranking • Stochastic OT: consider a distribution over all possible grammars to generate variation • Today: model frequency of input/output pairs (among the possible) directly using a powerful model

  4. Maximum entropy and OT • Imaginary data: • Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each • Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}

  5. Maximum entropy • Why have Z? • Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1 • So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant • Z can quickly become difficult to compute, when number of candidates is large • Very similar proposal in Smolensky, 86 • How to get w1, w2? • Learned from data (by calculating gradients) • Need: frequency counts, violation vectors (same as stochastic OT)

  6. Maximum entropy • Why do exp{.}? • It’s like take maximum, but “soft” -- easy to differentiate and optimize

  7. Maximum entropy and OT • Inputs are violation vectors: e.g. x=(2,0) and (0,1) • Outputs are one of K winners -- essentially a classification problem • Violating a constraint works against the candidate (prob ~ exp{-(x1*w1 + x2*w2)} • Crucial difference: ordering candidates by one score, not by lexico-graphic orders

  8. Maximum entropy • Ordering discrete outputs from input vectors is a common problem: • Also called Logistic Regression (recall Nearey) • Explaining the name: • Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1 Logistic transform Linear regression

  9. The power of Maximum Entropy • Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs • Recall Nearey: phones, diphones, … • NLP: tagging, labeling, parsing … (anything with a discrete output) • Easy to learn: only a global maximum, optimization efficient • Isn’t this the greatest thing in the world? • Need to understand the story behind the exp{} (in a few minutes)

  10. Demo: Spanish diminutives • Data from Arbisi-Kelm • Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle

  11. Stochastic OT and Max-Ent • Is better fit always a good thing?

  12. Stochastic OT and Max-Ent • Is better fit always a good thing? • Should model-fitting become a new fashion in phonology?

  13. The crucial difference • What are the possible distributions of p(.|/bap/) in this case?

  14. The crucial difference • What are the possible distributions of p(.|/bap/) in this case? • Max-Ent considers a much wider range of distributions

  15. What is Maximum Entropy anyway? • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy • Given a dice, which distribution has the largest entropy?

  16. What is Maximum Entropy anyway? • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy • Given a dice, which distribution has the largest entropy? • Add constraints to distributions: the average of some feature functions is assumed to be fixed: Observed value

  17. What is Maximum Entropy anyway? • Example of features: violations, word counts, N-grams, co-occurrences, … • The constraints change the shape of the maximum entropy distribution • Solve constrained optimization problem • This leads to p(x) ~ exp{k wk*fk(x)} • Very general (see later), many choices of fk

  18. The basic intuition • Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x)) • Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) • Common practice in NLP • This is better seen as a “descriptive” model

  19. Going towards Markov random fields • Maximum entropy applied to conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)} • There can be many creative ways of extracting features fk(x,y) • One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique • Known as Markov network/random field

  20. Conditional random field • Impose a chain-structured graph, and assign features to edges • Still a max-ent, same calculation m(yi, yi+1) f(xi, yi)

  21. Wilson’s idea • Isn’t this a familiar picture in phonology? m(yi, yi+1) -- Markedness Surface form f(xi, yi) Faithfulness Underlying form

  22. The story of smoothing • In Max-Ent models, the weights can get very large and “over-fit” the data (see demo) • Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights • Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning • Constraints that force less similarity --> a higher penalty for them to change value

  23. Wilson’s model fitting to the velar palatalization data

More Related