LING 696B: Maximum-Entropy and Random Fields

LING 696B: Maximum-Entropy and Random Fields

Review: two worlds • Statistical model and OT seem to ask different questions about learning • UG: what is possible/impossible? • Hard-coded generalizations • Combinatorial optimization (sorting) • Statistical: among the things that are possible, what is likely/unlikely? • Soft-coded generalizations • Numerical optimization • Marriage of the two?

Review: two worlds • OT: relate possible/impossible patterns in different languages through constraint reranking • Stochastic OT: consider a distribution over all possible grammars to generate variation • Today: model frequency of input/output pairs (among the possible) directly using a powerful model

Maximum entropy and OT • Imaginary data: • Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each • Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}

Maximum entropy • Why have Z? • Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1 • So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant • Z can quickly become difficult to compute, when number of candidates is large • Very similar proposal in Smolensky, 86 • How to get w1, w2? • Learned from data (by calculating gradients) • Need: frequency counts, violation vectors (same as stochastic OT)

Maximum entropy • Why do exp{.}? • It’s like take maximum, but “soft” -- easy to differentiate and optimize

Maximum entropy and OT • Inputs are violation vectors: e.g. x=(2,0) and (0,1) • Outputs are one of K winners -- essentially a classification problem • Violating a constraint works against the candidate (prob ~ exp{-(x1*w1 + x2*w2)} • Crucial difference: ordering candidates by one score, not by lexico-graphic orders

Maximum entropy • Ordering discrete outputs from input vectors is a common problem: • Also called Logistic Regression (recall Nearey) • Explaining the name: • Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1 Logistic transform Linear regression

The power of Maximum Entropy • Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs • Recall Nearey: phones, diphones, … • NLP: tagging, labeling, parsing … (anything with a discrete output) • Easy to learn: only a global maximum, optimization efficient • Isn’t this the greatest thing in the world? • Need to understand the story behind the exp{} (in a few minutes)

Demo: Spanish diminutives • Data from Arbisi-Kelm • Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle

Stochastic OT and Max-Ent • Is better fit always a good thing?

Stochastic OT and Max-Ent • Is better fit always a good thing? • Should model-fitting become a new fashion in phonology?

The crucial difference • What are the possible distributions of p(.|/bap/) in this case?

The crucial difference • What are the possible distributions of p(.|/bap/) in this case? • Max-Ent considers a much wider range of distributions

What is Maximum Entropy anyway? • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy • Given a dice, which distribution has the largest entropy?

What is Maximum Entropy anyway? • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy • Given a dice, which distribution has the largest entropy? • Add constraints to distributions: the average of some feature functions is assumed to be fixed: Observed value

What is Maximum Entropy anyway? • Example of features: violations, word counts, N-grams, co-occurrences, … • The constraints change the shape of the maximum entropy distribution • Solve constrained optimization problem • This leads to p(x) ~ exp{k wk*fk(x)} • Very general (see later), many choices of fk

The basic intuition • Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x)) • Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) • Common practice in NLP • This is better seen as a “descriptive” model

Going towards Markov random fields • Maximum entropy applied to conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)} • There can be many creative ways of extracting features fk(x,y) • One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique • Known as Markov network/random field

Conditional random field • Impose a chain-structured graph, and assign features to edges • Still a max-ent, same calculation m(yi, yi+1) f(xi, yi)

Wilson’s idea • Isn’t this a familiar picture in phonology? m(yi, yi+1) -- Markedness Surface form f(xi, yi) Faithfulness Underlying form

The story of smoothing • In Max-Ent models, the weights can get very large and “over-fit” the data (see demo) • Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights • Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning • Constraints that force less similarity --> a higher penalty for them to change value

Wilson’s model fitting to the velar palatalization data

LING 696B: Maximum-Entropy and Random Fields

LING 696B: Maximum-Entropy and Random Fields

Presentation Transcript

Cortical Entropy Changes with General Anaesthesia: Experiment and Theory.

Markov Random Fields

The Maximum Network Flow Problem

ENERGY CONVERSION ONE (Course 25741)

MIT Class: Sources of Magnetic Fields Creating Fields: Biot-Savart Experiment: Magnetic Fields Ampere’s Law

Maximum Flow Applications

Magnetic fields in the Universe

Chapter 19 – Principles of Reactivity: Entropy and Free Energy

Maximum Entropy Model (I)

Conditional Random Fields and Direct Decoding for Speech and Language Processing

Conditional Random Fields

The Geometry of Generalized Hyperbolic Random Field

Chapter 8. THERMODYNAMICS: THE SECOND AND THIRD LAW

Classification

Statistics

Module #4 – Information, Entropy, Thermodynamics, and Computing

Machine Learning Models on Random Graphs

Chapter 3-2 Discrete Random Variables

Conditional Random Fields for Automatic Speech Recognition

Chapter 5. Joint Probability Distributions and Random Sample

Chapter 3 The Second and Third Laws of Thermodynamics

Chapter 5: Continuous Random Variables