500 likes | 807 Views
A Review of Sequential Supervised Learning. Guohua Hao School of Electrical Engineering and Computer Science Oregon State University. Outline. Sequential supervised learning Evaluation criteria Methods for sequential supervised learning Open problems. Traditional supervised learning.
E N D
A Review of Sequential Supervised Learning Guohua Hao School of Electrical Engineering and Computer Science Oregon State University
Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems
Traditional supervised learning • Given training examples (x, y) x– observation feature vector y – class label Independently and identically assumption
Sequential learning tasks • Part-of-speech tagging • Assign a part-of-speech tag to each word in a sentence • Example “I am taking my examination” <pron verb verb pron noun>
Sequential supervised learning • Given training examples observation sequence label sequence • Goal – learn a classifier
Applications • Protein secondary structure prediction • Name entity recognition • FAQ document segmentation • Optical character recognition for words • NetTalk task
Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems
Modeling accuracy • Two kinds of relations • Horizontal relations Directed or undirected graphical model • Vertical relations Generative or discriminative model Feature representation yt-1 yt yt+1 Xt-1 xt xt+1
Feature representation • Arbitrary non-independent features of observation sequence • Overlapping features • Global features • Very high or infinite dimensional feature spaces – kernel method
Computational efficiency • Training efficiency • Testing efficiency • Scalability to large number of features and labels
Generalization bound • Margin-maximizing property
Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems
Sliding window/recurrent sliding window • Sliding window • Vertical relations ok • Horizontal relations that depend on nearby x’s • Recurrent sliding window • Both vertical and horizontal relations ok • Only one direct horizon relations yt-1 yt yt+1 Xt-1 xt xt+1 yt-1 yt yt+1 Xt-1 xt xt+1
Evaluation • Feature representation • arbitrary non-independent observation features ok • Computational efficiency • Depends on the classical supervised learning algorithm used
Generative model – Hidden Markov Models (HMM) yt-1 yt yt+1 • Representation of joint distribution • Extension of Naïve Bayesian networks • Two kinds of distributions • State transition probability • state-specific observation distribution Xt-1 xt xt+1
Evaluation • Computational efficiency • Maximize the likelihood (ML)– training efficient • Prediction Per sequence loss–Viterbi O(TN2) Per label loss–Forward backward O(TN2)
Evaluation • Modeling accuracy • Vertical relations modeled in generative way ML training leads to suboptimal • Feature representation • Conditional independence assumption • Arbitrary non-independent observation features not allowed
Discriminative graphical model • Model the conditional distribution • Extension of logistic regression • Arbitrary non-independent observation features allowed • Typical methods • Maximum entropy Markov model (MEMM) • Conditional random fields (CRF)
Maximum entropy Markov models • Per state transition distribution • Maximum entropy formulation yt-1 yt yt+1 Xt-1 xt xt+1
Evaluation • Training – Generalized iterative scaling Prediction – Viterbi or forward backward • Label bias problem • Drawback of directed conditional Markov model • Per source state normalization • Low entropy distribution pay little attention to the observation • Favorite the more frequent sequences
Conditional Random Fields • Undirected Graphical model • Markov Random Fields globally conditioned on X • Two kinds of features • Dependence between neighboring labels on x • Dependence of current label and x yt-1 yt yt+1 x
Training CRF • Loss function Per sequence vs. per label • Optimization methods • Improved iterative scaling – slow • General purpose convex optimization – improved
Problems with rich feature representation • Large number features • Slow down parameter estimation • Eliminate redundancy • More expressive features • Improve prediction accuracy • Combinatorial explosion • Incorporate necessary combinations
Feature induction • Iteratively construction feature conjunctions • Candidate features Atomic Conjunction of atomic and incorporated • Maximum increase in conditional log likelihood
Feature induction (cont’d) • Gradient tree boosting • Potential function = sum of gradient trees • Path of tree – feature combination • Value in leaf – weight • Significant improvement of training speed and prediction accuracy
More evaluation • No scale to large number classes • Forward backward • Viterbi • No generalization bound
Traditional discriminant function • Directly measure compatibility between label and observation • Simple linear discriminant function • Classifier
Support vector machines • Maximize margin of classification confidence • L1-norm soft-margin SVMs formulation • Functional margin • Slack variable
Dual formulation • Lagrange multiplier • Dual optimization problem • Dual discriminant function
Kernel trick • Kernel function • Avoid explicit feature representation • Feature space of very high/ infinite dimension • Non-linear discriminant function/classifier
Voted perceptron • Perceptron—online algorithm • Voted perceptron—convert to batch algorithm • Deterministic leave-one-out method • Predict by majority voting • Kernel trick and computationally efficient
Extension to multi-class • Discriminant function • Classifier • Functional margin
Multi-class SVMs • L1 norm soft-margin SVMs
Discriminant function in sequential supervised learning • Treat as multi-class problems • Exponentially large number of classes • Learning the discriminant function • Voted perceptron • Support vector machines
Feature representation • Arbitrary non-independent observation feature • Feature space of high/infinite dimension • Feature formulation assumption • Chain structured • Additive
Voted perceptron • Update step • Average the perceptrons • Prediction– Viterbi algorithm • Computationally efficient
SVMs with loss function • Re-scale slack variables • Re-scale margin • Higher loss requires larger margin ( more confidence)
Hidden Markov Support Vector Machines • Sparseness assumption on Support Vectors • Small number of non-zero dual variables • Small number of active constraints • Iteratively add new SVs • Working set containing current SVs • Candidate SV violates margin constraint most • Strict increase of dual objective function
Margin violation exceeds more than • Upper bound on dual objective • Polynomial number of SVs when convergence • Dual objective optimization tractable • Close to the optimal solution
Max Margin Markov Networks • Exploiting structure of output space • Convert to polynomial number of constraints • Rescale margin with Hamming loss per label zero-one loss
Structure Decomposition • Loss function • Feature
Factored dual • Objective function • Constraint and consistency check • Polynomial
Evaluation • Arbitrary non-independent observation feature • Kernel trick • feature space of very high/infinite dimension • Complex non-linear discriminant function • Margin maximizing – generalization bound • Not sure of scalability
Open problems • Training CRF faster and make it practical • Effect of inference in training • Scalability in discriminant function methods • Other algorithms to learn discriminant function • Deal with missing value • Novel sequential supervised learning algorithm