1 / 48

A Review of Sequential Supervised Learning

A Review of Sequential Supervised Learning. Guohua Hao School of Electrical Engineering and Computer Science Oregon State University. Outline. Sequential supervised learning Evaluation criteria Methods for sequential supervised learning Open problems. Traditional supervised learning.

satya
Download Presentation

A Review of Sequential Supervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Review of Sequential Supervised Learning Guohua Hao School of Electrical Engineering and Computer Science Oregon State University

  2. Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems

  3. Traditional supervised learning • Given training examples (x, y) x– observation feature vector y – class label Independently and identically assumption

  4. Sequential learning tasks • Part-of-speech tagging • Assign a part-of-speech tag to each word in a sentence • Example “I am taking my examination” <pron verb verb pron noun>

  5. Sequential supervised learning • Given training examples observation sequence label sequence • Goal – learn a classifier

  6. Applications • Protein secondary structure prediction • Name entity recognition • FAQ document segmentation • Optical character recognition for words • NetTalk task

  7. Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems

  8. Modeling accuracy • Two kinds of relations • Horizontal relations Directed or undirected graphical model • Vertical relations Generative or discriminative model Feature representation yt-1 yt yt+1 Xt-1 xt xt+1

  9. Feature representation • Arbitrary non-independent features of observation sequence • Overlapping features • Global features • Very high or infinite dimensional feature spaces – kernel method

  10. Computational efficiency • Training efficiency • Testing efficiency • Scalability to large number of features and labels

  11. Generalization bound • Margin-maximizing property

  12. Outline • Sequential supervised learning • Evaluation criteria • Methods for sequential supervised learning • Open problems

  13. Sliding window/recurrent sliding window • Sliding window • Vertical relations ok • Horizontal relations that depend on nearby x’s • Recurrent sliding window • Both vertical and horizontal relations ok • Only one direct horizon relations yt-1 yt yt+1 Xt-1 xt xt+1 yt-1 yt yt+1 Xt-1 xt xt+1

  14. Evaluation • Feature representation • arbitrary non-independent observation features ok • Computational efficiency • Depends on the classical supervised learning algorithm used

  15. Generative model – Hidden Markov Models (HMM) yt-1 yt yt+1 • Representation of joint distribution • Extension of Naïve Bayesian networks • Two kinds of distributions • State transition probability • state-specific observation distribution Xt-1 xt xt+1

  16. Evaluation • Computational efficiency • Maximize the likelihood (ML)– training efficient • Prediction Per sequence loss–Viterbi O(TN2) Per label loss–Forward backward O(TN2)

  17. Evaluation • Modeling accuracy • Vertical relations modeled in generative way ML training leads to suboptimal • Feature representation • Conditional independence assumption • Arbitrary non-independent observation features not allowed

  18. Discriminative graphical model • Model the conditional distribution • Extension of logistic regression • Arbitrary non-independent observation features allowed • Typical methods • Maximum entropy Markov model (MEMM) • Conditional random fields (CRF)

  19. Maximum entropy Markov models • Per state transition distribution • Maximum entropy formulation yt-1 yt yt+1 Xt-1 xt xt+1

  20. Evaluation • Training – Generalized iterative scaling Prediction – Viterbi or forward backward • Label bias problem • Drawback of directed conditional Markov model • Per source state normalization • Low entropy distribution pay little attention to the observation • Favorite the more frequent sequences

  21. Conditional Random Fields • Undirected Graphical model • Markov Random Fields globally conditioned on X • Two kinds of features • Dependence between neighboring labels on x • Dependence of current label and x yt-1 yt yt+1 x

  22. Training CRF • Loss function Per sequence vs. per label • Optimization methods • Improved iterative scaling – slow • General purpose convex optimization – improved

  23. Problems with rich feature representation • Large number features • Slow down parameter estimation • Eliminate redundancy • More expressive features • Improve prediction accuracy • Combinatorial explosion • Incorporate necessary combinations

  24. Feature induction • Iteratively construction feature conjunctions • Candidate features Atomic Conjunction of atomic and incorporated • Maximum increase in conditional log likelihood

  25. Feature induction (cont’d) • Gradient tree boosting • Potential function = sum of gradient trees • Path of tree – feature combination • Value in leaf – weight • Significant improvement of training speed and prediction accuracy

  26. More evaluation • No scale to large number classes • Forward backward • Viterbi • No generalization bound

  27. Traditional discriminant function • Directly measure compatibility between label and observation • Simple linear discriminant function • Classifier

  28. Support vector machines • Maximize margin of classification confidence • L1-norm soft-margin SVMs formulation • Functional margin • Slack variable

  29. Dual formulation • Lagrange multiplier • Dual optimization problem • Dual discriminant function

  30. Kernel trick • Kernel function • Avoid explicit feature representation • Feature space of very high/ infinite dimension • Non-linear discriminant function/classifier

  31. Voted perceptron • Perceptron—online algorithm • Voted perceptron—convert to batch algorithm • Deterministic leave-one-out method • Predict by majority voting • Kernel trick and computationally efficient

  32. Extension to multi-class • Discriminant function • Classifier • Functional margin

  33. Multi-class SVMs • L1 norm soft-margin SVMs

  34. Discriminant function in sequential supervised learning • Treat as multi-class problems • Exponentially large number of classes • Learning the discriminant function • Voted perceptron • Support vector machines

  35. Feature representation • Arbitrary non-independent observation feature • Feature space of high/infinite dimension • Feature formulation assumption • Chain structured • Additive

  36. Voted perceptron • Update step • Average the perceptrons • Prediction– Viterbi algorithm • Computationally efficient

  37. SVMs with loss function • Re-scale slack variables • Re-scale margin • Higher loss requires larger margin ( more confidence)

  38. Hidden Markov Support Vector Machines • Sparseness assumption on Support Vectors • Small number of non-zero dual variables • Small number of active constraints • Iteratively add new SVs • Working set containing current SVs • Candidate SV violates margin constraint most • Strict increase of dual objective function

  39. Margin violation exceeds more than • Upper bound on dual objective • Polynomial number of SVs when convergence • Dual objective optimization tractable • Close to the optimal solution

  40. Max Margin Markov Networks • Exploiting structure of output space • Convert to polynomial number of constraints • Rescale margin with Hamming loss per label zero-one loss

  41. Dual formulation

  42. Structure Decomposition • Loss function • Feature

  43. Factorization

  44. Factored dual • Objective function • Constraint and consistency check • Polynomial

  45. Evaluation • Arbitrary non-independent observation feature • Kernel trick • feature space of very high/infinite dimension • Complex non-linear discriminant function • Margin maximizing – generalization bound • Not sure of scalability

  46. Open problems • Training CRF faster and make it practical • Effect of inference in training • Scalability in discriminant function methods • Other algorithms to learn discriminant function • Deal with missing value • Novel sequential supervised learning algorithm

  47. Thank You

More Related