Improving Supervised Classification using Confidence Weighted Learning

Improving Supervised Classification using Confidence Weighted Learning Koby Crammer Joint work with Mark Dredze, Alex Kulesza and Fernando Pereira Workshop in Machine Learning, The EE department, Technion January 20, 2010

Linear Classifiers Input Instance to be classified Weight vector of classifier

Big datasets, large number of features Many features are only weakly correlated with target label Linear classifiers: features are associated with word-counts Heavy-tailed feature distribution Natural Language Processing Counts Feature Rank

Sentiment Classification • Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended! Pang, Lee, Vaithyanathan, EMNLP 2002

Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M Suffer Loss l(y,y) Get True Label y

Sentiment Classification • Many positive reviews with the word best Wbest • Later negative review • “boring book – best if you want to sleep in seconds” • Linear update will reduce both Wbest Wboring • But best appeared more than boring • Better to reduce words in different rate Wboring Wbest

Linear Model  Distribution over Linear Models Mean weight-vector Example

New Prediction Models • Gaussian distributions over weight vectors • The covariance is either full or diagonal • In NLP we have many features and use a diagonal covariance

Weight Vector (Version) Space The algorithm forces that most of the values of would reside in this region

Passive Step Nothing to do, most of the weight vectors already classifies the example correctly

Aggressive Step The mean is moved beyond the mistake-line (Large Margin) The covariance is shirked in the direction of the input example The algorithm projects the current Gaussian distribution on the half-space

The Update • Projection update: • Can be solved analytically

20 features 2 informative (rotated skewed Gaussian) 18 noisy Using a single feature is as good as random prediction Synthetic Data

Synthetic Data (cntd.) Distribution after 50 examples (x1)

Synthetic Data (results) Perceptron PA 2nd Order CW-full CW-diag

Data • Binary document classification • Sentiment reviews: • 6 Amazon domains (Blitzer et al) • Reuters (RCV1): • 3 pairs of labels • 20 News Groups: • 3 of labels • About 2000 instances per dataset • Bag of words representation • 10 Fold Cross-Validation; 5 epochs

Results vs Batch - Sentiment • always better than batch methods • 3/6 significantly better

Results vs Batch - 20NG + Reuters • 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

Parallel Training • Split large data into disjoint sets • Train using each set independently • Combine resulting classifiers • AverageClassifiers performance • Uniformmean of linear weights • Weighted mean of linear weights using confidence information

Parallel Training • Data Size: Sentiment ~1M ; Reuters ~0.8M • #Features/#Docs: Sentiment ~13 ; Reuters ~0.35 • Performance degradateswith number of splits • Weighting improves performance Baseline (CW)

Multiple constraints per instance Approximate using a single constraint Crammer, Dredze, Kulesza. EMNLP 2008 Multi-Class Update Constraints for labels Approximation

Evaluation Setup Nine multi-class datasets Crammer, Dredze, Kulesza. EMNLP 2008

Evaluation Crammer, Dredze, Kulesza. EMNLP 2008 Better than all baselines (online and batch): 8 of 9 datasets

20 Newsgroups Crammer, Dredze, Kulesza. EMNLP 2008 Better than all online baselines: 8 of 9 datasets

Dredze, Kulesza, Crammer. MLJ 2009 Multi-Domain Learning • Task: sentiment classification • Goal: reviews differ across domains • Electronics • Books • Movies • Kitchen Appliances • Challenge: domains differ • Domains use different features • Domains may behave differently towards features Blitzer, Dredze, Pereira, ACL 2007

Shared parameters Parameters used for every domain Domain parameters Separate parameters for every domain Differing Feature Behaviors • Share similar behaviors across domains • Learn domain specific behaviors

Combining Domain Parameters Shared 2 Domain Specific -1 Combined .5

Combined classifier Individual classifiers Weighting Classifier Combination • CW classifier is a distribution over weight vectors

Multi-Domain Regularization • Combined classifier for prediction and updates • Based on Evgeniou and Pontil, KDD 2004 • Passive-aggressive update rule • Find shared model and individual model closest to current corresponding models • Such that their combination will perform well on current example Smallest parameter change 1) 2) Classify example correctly

Evaluation on Sentiment • Sentiment classification • Rate product reviews: positive/negative • 4 datasets • All- 7 Amazon product types • Books- different rating thresholds • DVDs- different rating thresholds • Books+DVDs • 1500 train, 100 test per domain

Results Books, DVDs, Books+DVDsp=.001 Test error (smaller better) 10-fold CV, one pass online training

Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples

Dredze & Crammer, ACL 2008 Picking the Next Example • Random • Linear Classifiers • Example with lowest margin • Active Confidence Learning : • Example with least confidence • Equivalent to lowest normalized-margin

Dredze & Crammer, ACL 2008 Active Learning • 13 Datasets : • Sentiment (4), 20NG (3), Reuters (3), SPAM (3)

Dredze & Crammer, ACL 2008 Active Learning • Amount of labels needed by CW Margin and ACL to achieve 80% of the accuracy of training with all data

Summary • Online training is fast and effective … • … but, NLP data has heavy-tailed feature distribution • New Model: • Add feature confidence parameters • Benefits: • better than state-of-the-art training algorithms for linear classifiers • Converges faster • Theoretical guaranties • Allows better combination of models trained in parallel and better active learning • better domain adaptation

Improving Supervised Classification using Confidence Weighted Learning

Improving Supervised Classification using Confidence Weighted Learning

Presentation Transcript

Supervised Learning

Classification (Supervised Clustering)

Supervised learning

Supervised Learning

Semi-supervised protein classification using cluster kernels

Supervised Learning

Supervised Classification

Soft-Supervised Learning for Text Classification

Classification (Discrimination, Supervised Learning) Using Microarray Data

Supervised classification

Supervised Learning Regression, Classification Linear regression, k- NN classification

Supervised Learning, Classification, Discrimination

Gene family classification using a semi-supervised learning method

Supervised Multiattribute Classification

Supervised Learning

Classification and Supervised Learning

Supervised classification

Supervised Learning

Supervised Learning

Supervised Classification

Supervised Learning

EEG Classification using Semi Supervised Learning