170 likes | 313 Views
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller. Presented by Michael Cafarella CSE574 May 25, 2005. Introduction. Kernel methods (SVMs) and max-margin are terrific for classification No way to model structure, relations
E N D
Max-Margin Markov Networksby Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005
Introduction • Kernel methods (SVMs) and max-margin are terrific for classification • No way to model structure, relations • Graphical models (Markov networks) can capture complex structure • Not trained for discrimination • Maximum Margin Markov (M3) Networks capture advantages of both
Standard classification • Want to learn a classification function: • f(x,y) are the features (basis functions), w are weights • y is a multi-label classification. The possible assignments, Y, is exponential in number of labels l • So, can’t compute argmax, can’t even represent all the features
Probabilistic classification • Graphical model defines P(Y|X). Select label argmaxy P(y | x) • Exploit sparseness in dependencies through model design. (e.g., OCR chars are independent given neighbors) • We’ll use pairwise Markov network to model: • Each pot-func is log sum of basis functions
M3N • For regular Markov networks, we train w to maximize likelihood or cond. likelihood • For M3N, we’ll train w to maximize margin • Main contribution of this paper is how to choose w accordingly
Choosing w • With SVMs, choose w to maximize margin • Where • Constraints ensure Maximizing margin magnifies difference between value of true label and the best runner up
Multiple labels • Structured problems have multiple labels, not a single classification • We extend “margin” to scale with the number of mistaken labels. So we now have: • Where:
Convert to optimization prob • We can remove margin term to obtain a quadratic program: • We have to add slack variables, because data might not be separable • We can now reformulate the whole M3N learning problem as the following optimization task…
Grand formulation • The primal: • The dual: • Note extra dual vars; have no effect on sol.
Unfortunately, not enough! • Constraints in primal, and #vars in dual, are exponential in #labels, l • Let’s interpret variables in dual as density function over y, conditional on x • Dual objective is function of expectations; we need just node, edge marginals of dual vars to compute them • Define marginal dual vars as:
Now reformulate the QP • But first, a pause • I can’t copy any more formulae. • I’m sorry. • It’s making me crazy. • I just can’t. • Please refer to the paper, section 4! • OK, now back to work…
Now reformulate the QP (2) • The duals vars must arise from a legal density. Or, they must be in the marginal polytope. • See equation 9! • That means we must enforce consistency between pairwise and singleton marginal vars • See equation 10! • If network is not a forest, those constraints aren’t enough • Can triangulate and add new vars, constraints • Or, approximate a relaxation of the polytope using belief prop
Experiment #1: Handwriting • 6100 words, 8 chars long, 150 subjects • Each char is 16x8 pixels • Y is classified word, each Yi is one of the 26 letters • LogReg and CRFs, train by max’ing cond likelihood of labels given features • SVMs and M3N, train by margin maximization
Experiment #2: Hypertext • The usual collective classification task • Four CS departments. Each page is one of course, faculty, student, project, other • Each page has web & anchor text, represented as binary feature vector • Also has hyperlinks to other examples • RMN trained to max CP of labels, given text & links • SVM and M3N trained w/max-margin
Conclusions • M3N seem to work great for discriminative tasks • Nice to borrow theoretical results from SVMs • Not much testing so far • Future work should use more complicated models, problems • Future presentations should be done in Latex, not Powerpoint