Learning for Structured Prediction Overview of the Material

Learning for Structured PredictionOverview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAA

Outline • Type of structures considered • Generative vs Discriminative • Global discriminative vs local discriminative • Decoding: • at testing vs at learning • methods for decoding • Predefined features vs latent features • I will use red italic to have illustration of methods; oversimplify some points

Types of Structures • Sequences: • Chain CRFs, HMMs, (chain type) M3Ns, .... • Trees: • Constituency trees: weighted CFGs (including LA-PCFGs), left-corner/shift-reduce parsers (the MaxEnt parser, ISBN parser,...) • Dependency structures: MST-parser, Nivre’s shift reduce parser, ... • Rankings • Prank (today) • Not considered: DAGs (e.g., some semantic representations), Bipartite graphs (machine translation), or more general graphs ...

Generative vs Discriminative • Discriminative: CRFs, MEMM, Structured Perceptron, Max-Margin Markov Networks (M3Ns),... • Learn mapping from to , so that expected error is minimal • Pros: • model what you actually care about • complex features of x are easy to integrate • different errors can be considered • less assumptions (and therefore, better asymptotic performance) • Generative • Score how likely is the combination of input and output • Pros: • easier to learn (if everything is observable – ML parameters are normalized counts) • “cleaner” semi-supervised learning , select to maximize • often, better with small datasets • some approaches care about (speech recognition, statistical machine translation,...) • arguably, preferable with latent variables HMMs, PCFGs (including the LA-PCFGs), ...

Global Discr. vs Local Discr. • Local (distribs over small decsions) MEMMs, SVM decision classifiers in Nivre’s shift reduce parser • Pros: • no real decoding at training time (cheap learning) • complex features of can be integrated easily (about training! still need to decode at testing) • Cons: • mismatch btw test and train modes: rely on true features in training and on predicted ones in testing • label bias (cannot dump a unlikely transition if the number of outgoing states is not sufficiently large) • Global (distribs over the entire sequences) structperceptron, CRFs, M3Ns (model: MST parser) • Pros • Theoretically much cleaner and in practice works better • Cons • Decoding at training time (+ partition function for CRFs); but approximate learning methods exist • Learning can be very problematic if complex features of are used • Both models require decoding at testing. Decoding does not really depend on the training criteria but on the features of

Specific learning criteria • CRFs • Maximize • Perceptron • Ensure separability on the training set (with large margin in some variations – e.g., ALMA): rank correct structure above incorrect one • Max-Margin Markov Networks (M3Ns) • Separate training set with maximal margin (sensitively to the error) • For every labeled example • where is any structure, is some loss function (e.g., Hamming distance for sequence measuring how many labels do not match) • “Wrong sequences with small errors should be penalized less than with more errors” • SVM-Struct, Boosting, ....

Decoding at training vs testing: examples • Different combinations are possible ....

Inference (argmax) • Simple dependencies in y: • Viterbi to find the most likely sequence (or, Chi-Liu-Edmonds for MST) • Or, marginal decoding to find the most likely label for every “position” • Complex dependencies: • Beam or greedy search (or some smarter search methods) • Reformulate the inference problems as a integer linear program and use methods known in ILP • (We do not care here when the inference is used: either at training or testing, or at both)

Latent Variables vs Explicit Features • Explicit features: • Pros: • Mostly convex optimization (no local minima) • Cheaper to learn • Cons: • Models is as good as the features are: extensive feature engineering needed • Non local dependencies in y are often necessary • Latent variable models: • Pros: • Learn how to propagate relevant information (learns complex features from simple ones) • Can learn a model with simple decompositions over extended y -- efficient decoding • Latent representation (e.g., extended parsing states or extended grammar) can potentially be useful in other tasks – multi-task learning • Cons: • Non-convex optimization – need to avoid local minima (tricky) • More expensive to train Most of the model we considered: CRFs, MEMMs, etc LA-PCFGs, ISBNs

Last bits • Term paper: due Mar 31 but send me ideas, outlines, draft well before the deadline (soon!) • Feedback on the content would be very much appreciated (as I am preparing a lecture class with a similar set of topics) • Thanks for participating!!!

Learning for Structured Prediction Overview of the Material

Learning for Structured Prediction Overview of the Material

Presentation Transcript

Overview of Eukaryotic Gene Prediction

Structured Transaction Overview

Learning Material

Efficient Decomposed Learning for Structured Prediction

Structured learning

LEARNING MATERIAL

LEARNING MATERIAL

Structured Learning Conversations

Overview of Assessor Material

Information Retrieval as Structured Prediction

LEARNING MATERIAL

Semi-supervised Structured Prediction Models

Learning structured ouputs

Structured Workplace Learning

Diversified Retrieval as Structured Prediction

Learning Structured Models for Phone Recognition

Structured Prediction with Perceptrons and CRFs

Structured learning

Search-Based Structured Prediction

Learning Structured Prediction Models: A Large Margin Approach

Learning Material For Android Development

Transfer Learning for Link Prediction