430 likes | 551 Views
Improving Supervised Classification using Confidence Weighted Learning. Koby Crammer Joint work with Mark Dredze, Alex Kulesza and Fernando Pereira. Workshop in Machine Learning, The EE department, Technion January 20, 2010. Linear Classifiers. Input Instance to be classified.
E N D
Improving Supervised Classification using Confidence Weighted Learning Koby Crammer Joint work with Mark Dredze, Alex Kulesza and Fernando Pereira Workshop in Machine Learning, The EE department, Technion January 20, 2010
Linear Classifiers Input Instance to be classified Weight vector of classifier
Big datasets, large number of features Many features are only weakly correlated with target label Linear classifiers: features are associated with word-counts Heavy-tailed feature distribution Natural Language Processing Counts Feature Rank
Sentiment Classification • Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended! Pang, Lee, Vaithyanathan, EMNLP 2002
Online Learning Maintain Model M Get Instance x Update Model Predict Label y=M(x) M Suffer Loss l(y,y) Get True Label y
Sentiment Classification • Many positive reviews with the word best Wbest • Later negative review • “boring book – best if you want to sleep in seconds” • Linear update will reduce both Wbest Wboring • But best appeared more than boring • Better to reduce words in different rate Wboring Wbest
Linear Model Distribution over Linear Models Mean weight-vector Example
New Prediction Models • Gaussian distributions over weight vectors • The covariance is either full or diagonal • In NLP we have many features and use a diagonal covariance
Weight Vector (Version) Space The algorithm forces that most of the values of would reside in this region
Passive Step Nothing to do, most of the weight vectors already classifies the example correctly
Aggressive Step The mean is moved beyond the mistake-line (Large Margin) The covariance is shirked in the direction of the input example The algorithm projects the current Gaussian distribution on the half-space
The Update • Projection update: • Can be solved analytically
20 features 2 informative (rotated skewed Gaussian) 18 noisy Using a single feature is as good as random prediction Synthetic Data
Synthetic Data (cntd.) Distribution after 50 examples (x1)
Synthetic Data (results) Perceptron PA 2nd Order CW-full CW-diag
Data • Binary document classification • Sentiment reviews: • 6 Amazon domains (Blitzer et al) • Reuters (RCV1): • 3 pairs of labels • 20 News Groups: • 3 of labels • About 2000 instances per dataset • Bag of words representation • 10 Fold Cross-Validation; 5 epochs
Results vs Batch - Sentiment • always better than batch methods • 3/6 significantly better
Results vs Batch - 20NG + Reuters • 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse
Parallel Training • Split large data into disjoint sets • Train using each set independently • Combine resulting classifiers • AverageClassifiers performance • Uniformmean of linear weights • Weighted mean of linear weights using confidence information
Parallel Training • Data Size: Sentiment ~1M ; Reuters ~0.8M • #Features/#Docs: Sentiment ~13 ; Reuters ~0.35 • Performance degradateswith number of splits • Weighting improves performance Baseline (CW)
Multiple constraints per instance Approximate using a single constraint Crammer, Dredze, Kulesza. EMNLP 2008 Multi-Class Update Constraints for labels Approximation
Evaluation Setup Nine multi-class datasets Crammer, Dredze, Kulesza. EMNLP 2008
Evaluation Crammer, Dredze, Kulesza. EMNLP 2008 Better than all baselines (online and batch): 8 of 9 datasets
20 Newsgroups Crammer, Dredze, Kulesza. EMNLP 2008 Better than all online baselines: 8 of 9 datasets
Dredze, Kulesza, Crammer. MLJ 2009 Multi-Domain Learning • Task: sentiment classification • Goal: reviews differ across domains • Electronics • Books • Movies • Kitchen Appliances • Challenge: domains differ • Domains use different features • Domains may behave differently towards features Blitzer, Dredze, Pereira, ACL 2007
Shared parameters Parameters used for every domain Domain parameters Separate parameters for every domain Differing Feature Behaviors • Share similar behaviors across domains • Learn domain specific behaviors
Combining Domain Parameters Shared 2 Domain Specific -1 Combined .5
Combined classifier Individual classifiers Weighting Classifier Combination • CW classifier is a distribution over weight vectors
Multi-Domain Regularization • Combined classifier for prediction and updates • Based on Evgeniou and Pontil, KDD 2004 • Passive-aggressive update rule • Find shared model and individual model closest to current corresponding models • Such that their combination will perform well on current example Smallest parameter change 1) 2) Classify example correctly
Evaluation on Sentiment • Sentiment classification • Rate product reviews: positive/negative • 4 datasets • All- 7 Amazon product types • Books- different rating thresholds • DVDs- different rating thresholds • Books+DVDs • 1500 train, 100 test per domain
Results Books, DVDs, Books+DVDsp=.001 Test error (smaller better) 10-fold CV, one pass online training
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Active Learning • Start with a pool of unlabeled examples • Use few labeled examples to choose an initial hypothesis • Iterative Algorithm : • Use current classifier to pick an example to be labeled • Train using all labeled examples
Dredze & Crammer, ACL 2008 Picking the Next Example • Random • Linear Classifiers • Example with lowest margin • Active Confidence Learning : • Example with least confidence • Equivalent to lowest normalized-margin
Dredze & Crammer, ACL 2008 Active Learning • 13 Datasets : • Sentiment (4), 20NG (3), Reuters (3), SPAM (3)
Dredze & Crammer, ACL 2008 Active Learning • Amount of labels needed by CW Margin and ACL to achieve 80% of the accuracy of training with all data
Summary • Online training is fast and effective … • … but, NLP data has heavy-tailed feature distribution • New Model: • Add feature confidence parameters • Benefits: • better than state-of-the-art training algorithms for linear classifiers • Converges faster • Theoretical guaranties • Allows better combination of models trained in parallel and better active learning • better domain adaptation