A Review of Sequential Supervised Learning

A Review of Sequential Supervised Learning Guohua Hao School of Electrical Engineering and Computer Science Oregon State University

Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems

Traditional supervised learning • Given training examples (x, y) x– observation feature vector y – class label Independently and identically assumption

Sequential learning tasks • Part-of-speech tagging • Assign a part-of-speech tag to each word in a sentence • Example “I am taking my examination” <pron verb verb pron noun>

Sequential supervised learning • Given training examples observation sequence label sequence • Goal – learn a classifier

Applications • Protein secondary structure prediction • Name entity recognition • FAQ document segmentation • Optical character recognition for words • NetTalk task

Modeling accuracy • Two kinds of relations • Horizontal relations Directed or undirected graphical model • Vertical relations Generative or discriminative model Feature representation yt-1 yt yt+1 Xt-1 xt xt+1

Feature representation • Arbitrary non-independent features of observation sequence • Overlapping features • Global features • Very high or infinite dimensional feature spaces – kernel method

Computational efficiency • Training efficiency • Testing efficiency • Scalability to large number of features and labels

Generalization bound • Margin-maximizing property

Sliding window/recurrent sliding window • Sliding window • Vertical relations ok • Horizontal relations that depend on nearby x’s • Recurrent sliding window • Both vertical and horizontal relations ok • Only one direct horizon relations yt-1 yt yt+1 Xt-1 xt xt+1 yt-1 yt yt+1 Xt-1 xt xt+1

Evaluation • Feature representation • arbitrary non-independent observation features ok • Computational efficiency • Depends on the classical supervised learning algorithm used

Generative model – Hidden Markov Models (HMM) yt-1 yt yt+1 • Representation of joint distribution • Extension of Naïve Bayesian networks • Two kinds of distributions • State transition probability • state-specific observation distribution Xt-1 xt xt+1

Evaluation • Computational efficiency • Maximize the likelihood (ML)– training efficient • Prediction Per sequence loss–Viterbi O(TN2) Per label loss–Forward backward O(TN2)

Evaluation • Modeling accuracy • Vertical relations modeled in generative way ML training leads to suboptimal • Feature representation • Conditional independence assumption • Arbitrary non-independent observation features not allowed

Discriminative graphical model • Model the conditional distribution • Extension of logistic regression • Arbitrary non-independent observation features allowed • Typical methods • Maximum entropy Markov model (MEMM) • Conditional random fields (CRF)

Maximum entropy Markov models • Per state transition distribution • Maximum entropy formulation yt-1 yt yt+1 Xt-1 xt xt+1

Evaluation • Training – Generalized iterative scaling Prediction – Viterbi or forward backward • Label bias problem • Drawback of directed conditional Markov model • Per source state normalization • Low entropy distribution pay little attention to the observation • Favorite the more frequent sequences

Conditional Random Fields • Undirected Graphical model • Markov Random Fields globally conditioned on X • Two kinds of features • Dependence between neighboring labels on x • Dependence of current label and x yt-1 yt yt+1 x

Training CRF • Loss function Per sequence vs. per label • Optimization methods • Improved iterative scaling – slow • General purpose convex optimization – improved

Problems with rich feature representation • Large number features • Slow down parameter estimation • Eliminate redundancy • More expressive features • Improve prediction accuracy • Combinatorial explosion • Incorporate necessary combinations

Feature induction • Iteratively construction feature conjunctions • Candidate features Atomic Conjunction of atomic and incorporated • Maximum increase in conditional log likelihood

Feature induction (cont’d) • Gradient tree boosting • Potential function = sum of gradient trees • Path of tree – feature combination • Value in leaf – weight • Significant improvement of training speed and prediction accuracy

More evaluation • No scale to large number classes • Forward backward • Viterbi • No generalization bound

Traditional discriminant function • Directly measure compatibility between label and observation • Simple linear discriminant function • Classifier

Support vector machines • Maximize margin of classification confidence • L1-norm soft-margin SVMs formulation • Functional margin • Slack variable

Dual formulation • Lagrange multiplier • Dual optimization problem • Dual discriminant function

Kernel trick • Kernel function • Avoid explicit feature representation • Feature space of very high/ infinite dimension • Non-linear discriminant function/classifier

Voted perceptron • Perceptron—online algorithm • Voted perceptron—convert to batch algorithm • Deterministic leave-one-out method • Predict by majority voting • Kernel trick and computationally efficient

Extension to multi-class • Discriminant function • Classifier • Functional margin

Multi-class SVMs • L1 norm soft-margin SVMs

Discriminant function in sequential supervised learning • Treat as multi-class problems • Exponentially large number of classes • Learning the discriminant function • Voted perceptron • Support vector machines

Feature representation • Arbitrary non-independent observation feature • Feature space of high/infinite dimension • Feature formulation assumption • Chain structured • Additive

Voted perceptron • Update step • Average the perceptrons • Prediction– Viterbi algorithm • Computationally efficient

SVMs with loss function • Re-scale slack variables • Re-scale margin • Higher loss requires larger margin ( more confidence)

Hidden Markov Support Vector Machines • Sparseness assumption on Support Vectors • Small number of non-zero dual variables • Small number of active constraints • Iteratively add new SVs • Working set containing current SVs • Candidate SV violates margin constraint most • Strict increase of dual objective function

Margin violation exceeds more than • Upper bound on dual objective • Polynomial number of SVs when convergence • Dual objective optimization tractable • Close to the optimal solution

Max Margin Markov Networks • Exploiting structure of output space • Convert to polynomial number of constraints • Rescale margin with Hamming loss per label zero-one loss

Dual formulation

Structure Decomposition • Loss function • Feature

Factorization

Factored dual • Objective function • Constraint and consistency check • Polynomial

Evaluation • Arbitrary non-independent observation feature • Kernel trick • feature space of very high/infinite dimension • Complex non-linear discriminant function • Margin maximizing – generalization bound • Not sure of scalability

Open problems • Training CRF faster and make it practical • Effect of inference in training • Scalability in discriminant function methods • Other algorithms to learn discriminant function • Deal with missing value • Novel sequential supervised learning algorithm

Thank You

A Review of Sequential Supervised Learning

A Review of Sequential Supervised Learning

Presentation Transcript

Sequential Learning

Sequential Learning

Supervised Hebbian Learning

Semi-supervised Learning

Supervised Learning

Supervised learning

Supervised Learning

Overview of Supervised Learning

Semi-Supervised Learning

Supervised Machine Learning: A Review of Classification Techniques Kotsiantis S.B.

Supervised Learning

Semi-Supervised Learning

Overview of Supervised Learning

Supervised Learning

Revisiting Output Coding for Sequential Supervised Learning

Semi-Supervised Learning

Semi-Supervised Learning

Sequential and Spatial Supervised Learning

Supervised Learning

Supervised Learning

Supervised Learning