300 likes | 421 Views
Why does it work?. We have not addressed the question of why does this c lassifier performs well, given that the assumptions are unlikely to be satisfied. The linear form of the classifiers provides some hints. Projects Presentation on 12/15 9am Updates (see web site)
E N D
Why does it work? • We have not addressed the question of why does this classifier • performs well, given that the assumptions are unlikely to be • satisfied. • The linear form of the classifiers provides some hints. • Projects • Presentation on 12/15 9am • Updates (see web site) • Final Exam 12/11, in class • No class on Thursday • Happy Thanksgiving
Naïve Bayes: Two Classes • In the case of two classes we have that: • but since • We get (plug in (2) in (1); some algebra): • Which is simply the logistic (sigmoid) function used in the • neural network representation. We have: A = 1-B; Log(B/A) = -C. Then: Exp(-C) = B/A = = (1-A)/A = 1/A – 1 = + Exp(-C) = 1/A A = 1/(1+Exp(-C))
Note this is a bit different than the previous linearization. Rather than a single function, here we have argmax over several different functions. Graphical model. It encodes the NB independence assumption in the edge structure (siblings are independent given parents) Another look at Naive Bayes Linear Statistical Queries Model
s1 s2 s3 s4 s5 s6 o1 o2 o3 o4 o5 o6 Hidden Markov Model (HMM) • A probabilistic generative model: models the generation of an observed sequence. • At each time step, there are two variables: Current state (hidden), Observation • Elements • Initial state probability P(s1) (|S| parameters) • Transition probability P(st|st-1) (|S|^2 parameters) • Observation probability P(ot|st) (|S|x |O| parameters) • As before, the graphical model is an encoding of the independence assumptions: • P(st|st-1, st-2,…s1) =P(st|st-1) • P(ot| sT,…,st,…s1, oT,…,ot,…o1 )=P(ot|st) • Examples: POS tagging, Sequential Segmentation
s1=B s2=I s3=O s4=B s5=I s6=O o1 Mr. o2 Brown o3 blamed o4 Mr. o5 Bob o6 for HMM for Shallow Parsing • States: • {B, I, O} • Observations: • Actual words and/or part-of-speech tags
s1=B s2=I s3=O s4=B s5=I s6=O o1 Mr. o2 Brown o3 blamed o4 Mr. o5 Bob o6 for HMM for Shallow Parsing • Given a sentences, we can ask what the most likely state sequence is Transition probabilty: P(st=B|st-1=B),P(st=I|st-1=B),P(st=O|st-1=B), P(st=B|st-1=I),P(st=I|st-1=I),P(st=O|st-1=I), … Initial state probability: P(s1=B),P(s1=I),P(s1=O) Observation Probability: P(ot=‘Mr.’|st=B),P(ot=‘Brown’|st=B),…, P(ot=‘Mr.’|st=I),P(ot=‘Brown’|st=I),…, …
0.2 0.5 0.5 0.5 B I I I B 0.5 0.25 0.25 0.25 0.25 0.4 a a c d d Three Computational Problems • Decoding – finding the most likely path • Have: model, parameters, observations (data) • Want: most likely states sequence • Evaluation – computing observation likelihood • Have: model, parameters, observations (data) • Want: the likelihood to generate the observed data • In both cases – a simple minded solution depends on |S|T steps • Training – estimating parameters • Supervised: Have: model, annotated data(data + states sequence) • Unsupervised: Have: model, data • Want: parameters
( ) P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ = ( ) j P o o ; o ; : : : ; o ; s ; s ; : : : ; s 1 2 1 1 1 k k ¡ k ¡ k k ¡ ( ) ¢ P o ; o ; : : : ; o ; s ; s ; : : : ; s 1 1 1 2 1 k ¡ k ¡ k k ¡ = ( ) ( ) j ¢ P o s P o ; o ; : : : ; o ; s ; s ; : : : ; s 1 1 1 2 1 k k k ¡ k ¡ k k ¡ = ( ) ( ) j ¢ j P o s P s s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 2 1 2 k k k k ¡ k ¡ k ¡ k ¡ ( ) ¢ P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 2 1 2 k ¡ k ¡ k ¡ k ¡ = ( ) ( ) j ¢ j P o s P s s 1 k k k k ¡ ( ) ¢ P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 2 1 1 2 1 k ¡ k ¡ k ¡ k ¡ 1 k ¡ Y = ( ) [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s +1 1 t t t k k t =1 t Finding most likely state sequence in HMM (1)
a rg max ( ) j P s ; s ; : : : ; s o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ s ;s ;::: ;s 1 1 k k ¡ ( ) P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ = a rg max ( ) s ;s ;::: ;s P o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ = a rg max ( ) P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ s ;s ;::: ;s 1 1 k k ¡ 1 k ¡ Y = a rg max ( ) [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s 1 t t t +1 k k t s ;s ;::: ;s 1 1 k k ¡ =1 t Finding most likely state sequence in HMM (2)
1 k ¡ Y max ( ) [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s 1 +1 t t t k k t s ;s ;::: ;s 1 1 k k ¡ =1 t 1 k ¡ Y = max ( ) max [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s 1 +1 t t t k k t s s ;::: ;s 1 1 k k ¡ =1 t = max ( ) max [ ( ) ( )] j ¢ j ¢ j P o s P s s P o s 1 1 1 k k k k ¡ k ¡ k ¡ s s 1 k k ¡ 2 k ¡ Y max [ ( ) ( )] ( ) ¢ j ¢ j ¢ P s s P o s P s +1 1 t t t t s ;::: ;s 1 2 k ¡ =1 t = max ( ) max [ ( ) ( )] j ¢ j ¢ j P o s P s s P o s 1 1 1 k k k k ¡ k ¡ k ¡ s s 1 k k ¡ max [ ( ) ( )] ¢ j ¢ j ¢ P s s P o s : : : 1 2 2 2 k ¡ k ¡ k ¡ k ¡ s 2 k ¡ max [ ( ) ( )] ( ) ¢ j ¢ j ¢ P s s P o s P s 2 1 1 1 1 s 1 Finding most likely state sequence in HMM (3) A function of sk
max ( ) max [ ( ) ( )] j ¢ j ¢ j P o s P s s P o s 1 1 1 k k k k ¡ k ¡ k ¡ s s 1 k k ¡ max [ ( ) ( )] ¢ j ¢ j ¢ P s s P o s : : : 1 2 2 2 k ¡ k ¡ k ¡ k ¡ s 2 k ¡ max [ ( ) ( )] ¢ j ¢ j ¢ P s s P o s 3 2 2 2 s 2 max [ ( ) ( )] ( ) ¢ j ¢ j ¢ P s s P o s P s 2 1 1 1 1 s 1 Finding most likely state sequence in HMM (4) • Viterbi’s Algorithm • Dynamic Programming
Learning the Model • Estimate • Initial state probability P(s1) • Transition probability P(st|st-1) • Observation probability P(ot|st) • Unsupervised Learning (states are not observed) • EM Algorithm • Supervised Learning (states are observed; more common) • ML Estimate of above terms directly from data • Notice that this is completely analogues to the case of naive Bayes, and essentially all other models.
Prediction: predict tT that maximizes Another view of Markov Models Input: T States: Observations: W Assumptions:
T States: Observations: W Another View of Markov Models Input: As for NB:features are pairs and singletons of t‘s, w’s HMM is a linear model (over pairs of states and states/obs) Only 3 active features This can be extended to an argmax that maximizes the prediction of the whole state sequence and computed, as before, via Viterbi.
Learning with Probabilistic Classifiers • Learning Theory • We showed that probabilistic predictions can be viewed as predictions via Linear Statistical Queries Models (Roth’99). • The low expressivity explains Generalization+Robustness • Is that all? • It does not explain why is it possible to (approximately) fit the data with these models. Namely, is there a reason to believe that these hypotheses minimize the empirical error on the sample? • In General, No.(Unless it corresponds to some probabilistic assumptions that hold).
T States: Observations: W Example: probabilistic classifiers If hypothesis does not fit the training data - augment set of features(forget assumptions) Features are pairs and singletons of t‘s, w’s Additional features are included
Learning Protocol: Practice • LSQ hypotheses are computed directly: • Choose features • Compute coefficients • If hypothesis does not fit the training data • Augment set of features • (Assumptions will not be satisfied) • But now, you actually follow the Learning Theory Protocol: • Try to learn a hypothesis that is consistent with the data • Generalization will be a function of the low expressivity
Robustness of Probabilistic Predictors • Remaining Question: While low expressivity explains generalization, why is it relatively easy to fit the data? • Consider all distributions with the same marginals • (That is, a naïve Bayes classifier will predict the same regardless of which distribution really generated the data.) • Garg&RothECML’01): • Product distributions are “dense” in the space of all distributions. Consequently, for most generating distributions the resulting predictor’s error is close to optimal classifier (that is, given the correct distribution)
Summary: Probabilistic Modeling • Classifiers derived from probability density estimation models were viewed as LSQ hypotheses. • Probabilistic assumptions: + Guiding feature selection but also -Not allowing the use of more general features. • A unified approach: a lot of classifiers, probabilistic and others can be viewed as linear classier over an appropriate feature space.
What’s Next? • (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? • Yes • (2) If probabilistic hypotheses are actually like other linear functions, can you actually train them similarly (that is, discriminatively)? • Yes. • Classification: Logistics regression/Max Entropy • HMM: can be learned as a linear model, e.g., with a version of Perceptron (Structured Models; Spring 2013)
Recall: Naïve Bayes, Two Classes • In the case of two classes we have: • but since • We get (plug in (2) in (1); some algebra): • Which is simply the logistic (sigmoid) function used in the • neural network representation.
Conditional Probabilities The plot shows a (normalized) histogram of examples as a function of the dot product act = (wTx+ b) and a couple other functions of it. In particular, we plot the positive Sigmoid: • Data: Two class (Open/NotOpen Classifier) P(y= +1 | x,w)= [1+exp(-(wTx + b)]-1 Is this really a probability distribution?
Conditional Probabilities Plotting: For example z: y=Prob(label=1 | f(z)=x) (Histogram: for 0.8, # (of examples with f(z)<0.8)) Claim: Yes; If Prob(label=1 | f(z)=x) = x Then f(z) = f(z) is a probability dist. That is, yes, if the graph is linear. Theorem: Let X be a RV with distribution F. • F(X) is uniformly distributed in (0,1). • If U is uniform(0,1), F-1(U) is distributed F, where F-1(x) is the value of y s.t. F(y) =x. Alternatively: f(z) is a probability if: ProbU {z| Prob[(f(z)=1 · y]} = y Plotted for SNoW (Winnow). Similarly, perceptron; more tuning is required for SVMs.
Conditional Probabilities • (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? • Yes • General recipe • Train a classifier f using your favorite algorithm (Perceptron, SVM, Winnow, etc). Then: • Use Sigmoid1/1+exp{-(AwTx + B)} to get an estimate for P(y | x) • A, B can be tuned using a held out that was not used for training. • Done in LBJ, for example
Logistic Regression • (2) If probabilistic hypotheses are actually like other linear functions, can you actually train them similarly (that is, discriminatively)? • The logistic regression model assumes the following model: P(y= +/-1 | x,w)= [1+exp(-y(wTx + b)]-1 • This is the same model we derived for naïve Bayes, only that now we will not assume any independence assumption. We will directly find the best w. • Therefore training will be more difficult. However, the weight vector derived will be more expressive. • It can be shown that the naïve Bayes algorithm cannot represent all linear threshold functions. • On the other hand, NB converges to its performancefaster. How?
Logistic Regression (2) • Given the model: P(y= +/-1 | x,w)= [1+exp(-y(wTx + b)]-1 • The goal is to find the (w, b) that maximizes the log likelihood of the data: {x1,x2… xm}. • We are looking for (w,b) that minimizes the negative log-likelihood minw,b1m log P(y= +/-1 | x,w)= minw,b1m log[1+exp(-yi(wTxi + b)] • This optimization problem is called Logistics Regression • Logistic Regression is sometimes called the Maximum Entropy model in the NLP community (since the resulting distribution is the one that has the largest entropy among all those that activate the same features).
Logistic Regression (3) • Using the standard mapping to linear separators through the origin, we would like to minimize: minw1m log P(y= +/-1 | x,w)= minw,1m log[1+exp(-yi(wTxi)] • To get good generalization, it is common to add a regularization term, and the regularized logistics regression then becomes: minw f(w) = ½ wTw + C 1m log[1+exp(-yi(wTxi)], Where C is a user selected parameter that balances the two terms. Empirical loss Regularization term
Comments on discriminative Learning minw f(w) = ½ wTw + C 1m log[1+exp(-yiwTxi)], Where C is a user selected parameter that balances the two terms. • Since the second term can be considered the loss function • Therefore, regularized logistic regression can be related to other learning methods, e.g., SVMs. • L1 SVM solves the following optimization problem: minw f1(w) = ½ wTw + C 1m max(0,1-yi(wTxi) • L2 SVM solves the following optimization problem: minwf2(w) = ½ wTw + C 1m (max(0,1-yiwTxi))2 Empirical loss Regularization term
All methods are iterative methods, that generate a sequence wkthat converges to the optimal solution of the optimization problem above. Many options within this category: Iterative scaling: Low cost per iteration, slow convergence, updates each w component at a time Newton methods: High cost per iteration, faster convergence non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. Currently: Limited memory BFGS is very popular Stochastic Gradient Decent methods The runtime does not depend on n=#(examples); advantage when nis very large. Stopping criteria is a problem: method tends to be too aggressive at the beginning and reaches a moderate accuracy quite fast, but it’s convergence becomes slow if we are interested in more accurate solutions. Optimization: How to Solve
Summary • (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? • Yes • (2) If probabilistic hypotheses are actually like other linear functions, can you actually train them similarly (that is, discriminatively)? • Yes. • Classification: Logistic regression/Max Entropy • HMM: can be trained via Perceptron (Spring 2013(