Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

Maximum Entropy (ME)Maximum Entropy Markov Model (MEMM)Conditional Random Field (CRF)

Boltzmann-Gibbs Distribution • Given: • States s1, s2, …, sn • Density p(s) = ps • Maximum entropy principle: • Without any information, one chooses the density ps to maximize the entropy subject to the constraints

Boltzmann-Gibbs (Cnt’d) • Consider the Lagrangian • Take partial derivatives of L with respect to psand set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor

Exercise • From the Lagrangian derive

Boltzmann-Gibbs (Cnt’d) • Classification Rule • Use of Boltzmann-Gibbs as prior distribution • Compute the posterior for given observed data and features fi • Use the optimal posterior to classify

Boltzmann-Gibbs (Cnt’d) • Maximum Entropy (ME) • The posterior is the state probability density p(s | X), where X = (x1, x2, …, xn) • Maximum entropy Markov model (MEMM) • The posterior consists of transition probability densities p(s | s´, X)

Boltzmann-Gibbs (Cnt’d) • Conditional random field (CRF) • The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)

References • R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001. • T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. • P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, The MIT Press, 2001.

Maximum Entropy Approach

An Example • Five possible French translations of the English word in: • Dans, en, à, au cours de, pendant • Certain constraints obeyed: • When April follows in, the proper translation is en • How do we make the proper translation of a French word y under an English context x?

Formalism • Probability assignment p(y|x): • y: French word, x: English context • Indicator function of a context feature f

Expected Values of f • The expected value of f with respect to the empirical distribution • The expected value of f with respect to the conditional probabilityp(y|x)

Constraint Equation • Set equal the two expected values: or equivalently,

Maximum Entropy Principle • Given n feature functions fi, we want p(y|x) to maximize the entropy measure where p is chosen from

Constrained Optimization Problem • The Lagrangian • Solutions

Iterative Solution • Compute the expectation of fi under the current estimate of probability function • Update Lagrange multipliers • Update probability functions

Feature Selection • Motivation: • For a large collection of candidate features, we want to select a small subset • Incremental growth

Incremental Learning

Algorithm: Feature Selection

Approximation • Computation of maximum entropy model is costly for each candidate f • Simplification assumption: • The multipliers λ associated with S do not change when f is added to S

Approximation (cnt’d)

Approximate Solution

Conditional Random Field (CRF)

CRF

Feature Functions

Difference from MEMM • If the state feature is dropped, we obtain a MEMM model • The drawback of MEMM • The state probabilities are not learnt, but inferred • Bias can be generated, since the transition feature is dominating in the training

Difference from HMM • HMM is a generative model • In order to define a joint distribution, this model must enumerate all possible observation sequences and their corresponding label sequences • This task is intractable, unless observation elements are represented as isolated units

CRF Training Methods • CRF training requires intensive efforts in numerical manipulation • Preconditioned conjugate gradient • Instead of searching along the gradient, conjugate gradient searches along a carefully chosen linear combination of the gradient and the previous search direction • Limited-Memory Quasi-Newton • Limited-memory BFGS (L-BFGS) is a second-order method that estimates the curvature numerically from previous gradients and updates, avoiding the need for an exact Hessian inverse computation • Voted perceptron

Voted Perceptron • Like the perceptron algorithm, this algorithm scans through the training instances, updating the weight vectorλt when a prediction error is detected • Instead of taking just the final weight vector, the voted perceptron algorithms takes the average of theλt

Voted Perceptron (cnt’d)

References • A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum entropy approach to natural language processing • A. McCallum and F. Pereira, Maximum entropy Markov models for information extraction and segmentation • H. M. Wallach, Conditional random fields: an introduction • J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmentation and labeling sequence data • F. Sha and F. Pereira, Shallow parsing with conditional random fields

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF )

Presentation Transcript

Chronic Renal Failure(CRF)

MTN 003 Training

March 2012

Centro Ricerche Fiat

Chronic Renal Failure

SIRIRAJ

Brain-Endocrine-Stress Review The HPA axis

CRF-1

Sheffield CRF Structure

ASAT on TIMIT in Sinica

Synthesis of cortisol

[CRF] = Q.R.W [TRF]

CS626: NLP, Speech and the Web

CS626: NLP, Speech and the Web

Report on CRF Conference Hong Kong March 2006

CRF

MTN 003 Training

Synthesis of cortisol