180 likes | 643 Views
Chapter 6. Hidden Markov and Maximum Entropy Model. Daniel Jurafsky and James H. Martin 2008. Introduction. Maximum Entropy ( MaxEnt ) More widely known as multinomial logistic regression Begin from non-sequential classifier A probabilistic classifier
E N D
Chapter 6. Hidden Markov and Maximum Entropy Model Daniel Jurafsky and James H. Martin 2008
Introduction • Maximum Entropy (MaxEnt) • More widely known as multinomial logistic regression • Begin from non-sequential classifier • A probabilistic classifier • Exponential or log-linear classifier • Text classification • Sentiment analysis • Positive or negative opinion • Sentence boundary
Linear Regression • x(j): a particular instance • y(j)obs: observed label in the training set of x(j) • y(j)pred: predict value from linear regression model sum square error
Logistic Regression – simplest case of binary classification • Consider whether x is in class (1, true) or not (0, false) ∈ [0,1] w‧f ∈ (-∞,∞) ∈ [0,∞) ∈ (-∞,∞)
Logistic Regression – simplest case of binary classification
Maximum Entropy Modeling • Input: x (a word need to tag or a doc need to classify) • Features • Ends in –ing • Previous word is “the” • Each feature fi, weight wi • Particular class c • Z is a normalizing factor, used to make the prob. sum to 1
Maximum Entropy Modeling C = {c1, c2, …, cC} Normalization fi: A feature that only takes on the values 0 and 1 is also called an indicator function In MaxEnt, instead of the notation fi, we will often use the notation fi(c,x), meaning that a feature ifor a particular class c for a given observation x
Maximum Entropy Modeling Assume C = {NN, VB}
HMM vs. MEMM MEMM can condition on any useful feature of the input observation; in HMM this isn’t possible HMM MEMM word class
Conditional Random Fields (CRFs) • CRFs (Lafferty, McCallum, et al. 2001) constitute another conditional model based on maximal entropy • Like MEMM, CRFs are able to accommodate many possibly correlated features of the observation • However, CRFs are better able to trade off decisions at different sequence positions • MEMM were found to suffer from the label bias problem
Label Bias • The problem appears when the MEMM contains states with different output degrees • Because the probabilities of transitions from any given state must sum to 1, transitions from lower degree states receive higher probabilities than transitions from higher degree states • In the extreme case, transition from a state with degree 1always gets probability 1, effectively ignoring the observation • CRFs do not have this problem because they define a single ME-based distribution over the whole label sequence