740 likes | 956 Views
Asymmetric Boosting for Face Detection. presented by. Minh-Tri Pham Ph.D. Candidate and Research Associate Nanyang Technological University, Singapore. Overview. Online Asymmetric Boosting Fast Training and Selection of Haar -like Features using Statistics
E N D
Asymmetric Boosting for Face Detection presented by Minh-Tri PhamPh.D. Candidate and Research AssociateNanyang Technological University, Singapore
Overview • Online Asymmetric Boosting • Fast Training and Selection of Haar-like Features using Statistics • Detection with Multi-exit Asymmetric Boosting
Online Asymmetric Boosting CVPR’07 oral paper: Minh-Tri Pham and Tat-Jen Cham. Online Learning Asymmetric Boosted Classifiers for Object Detection. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, 2007.
Motivation • Usual goal of object detector: • Focused on accuracy • General detectors are designed to deal with different input spaces • Only one input space is used per application global input space Offline learned non-object region input space 1 Online learned? input space 3 object region input space 2
Supervisor-Student paradigm Supervisor: Slow but general Supervisor Detector Fast but limited Student Detector Input Output Student: • Supervisor = existing object detector • Student = online-learned object detector • Less complex model • Faster detection speed
Problem overview • Common appearance-based approach: • Classify a patch using a cascade or tree of boosted classifiers (Viola-Jones and variants): • F1, F2, …, FN: boosted classifiers • Main challenges for online learning a boosted classifier: • Asymmetric: P(non-object) >> P(object) • Online data pass pass pass pass F1 F2 FN object …. reject reject reject non-object
Review of current methods • P(non-object) >> P(object): • Viola and Jones (2002) • Ma and Ding (2003) • Hou et. al. (2006) • Reweigh positives higher and negatives lower • Offline learning only • Online learning for boosting: • Online Boosting of Oza (2005) • Replace offline weak learners with online weak learners • Propagate weights similarly to AdaBoost • Only works well when P(non-object) ≈ P(object) • Asymmetric Online Boosting • Incorporate asymmetric reweighing scheme into Online Boosting • Skewness balancing: • New reweighting scheme giving better accuracy • Polarity balancing: • Faster learning convergence rate
Skewness balancing • Skewness: • Measure the degree of asymmetry of the class probability distribution • Defined as: • = logP(negative) – logP(positive) • Viola-Jones’ reweighing scheme: • Reweigh positives the same amount more than negatives on every weak learner • km = reweighing amount on the m-th weak learner • k = total reweighing amount Initial skewness: 1 > 0 After reweighing: 1’ = 1 - log k1 After training weak learner 1: 2 ≈ 0 After training weak learner 2: 3 ≈ 0 After training weak learner 3: 4 ≈ 0 skewness After reweighing: 2’ = 2 - log k2 After reweighing: 3’ = 3 - log k3 After reweighing: 4’ = 4 - log k4 : negative example : positive example weak learners
Skewness balancing • Our approach: • Reweigh positives more than negatives differently, to have equal skewness presented to every weak learner • m = skewness after training the (m-1)-th weak learner • km = reweighing amount on the m-th weak learner • k = total reweighing amount Initial skewness: 1 > 0 After training weak learner 1: 2 ≈ 0 After training weak learner 2: 3 ≈ 0 After training weak learner 3: 4 ≈ 0 After reweighing: 1’ = 1 - log k1 After reweighing: 2’ = 2 - log k2 After reweighing: 3’ = 3 - log k3 After reweighing: 4’ = 4 - log k4 skewness : negative example : positive example weak learners
Skewness balancing • Effective for initial boosted classifiers in the cascade • Better accuracy faster detection speed • Effectiveness degraded as boosted classifiers get more complicated ROC curve for 4-feature boosted classifier ROC curve for 200-feature boosted classifier
Polarity balancing positive negative • After training a weak learner with AdaBoost: • classified weights = mis-classified weights • positive weights = negative weights (if weak learner is optimal) • To maintain onlineAdaBoost’s properties: • Online Boosting ensures asymptotically: • classified weights = mis-classified weights • Our method ensures asymptotically: • classified weights = mis-classified weights • positive weights = negative weights Faster convergence rate TP TN Correctly classified FN FP Wrongly classified Weight distribution after training a weak learner
Polarity balancing • Learning time: • About 5-30% faster with Polarity balancing Online Learning a 20-feature boosted classifier
Overall performance • ROC curves: • Similar results
Online Learning a Face Detector • Video clip: • Length: 20 minutes • Resolution: 352x288 • 25fps • Learn online from the first 10 minutes • using OpenCV’s face detector as supervisor • Test with the remaining 10 minutes OpenCV’s face detector Detection speed: 15fps Our online-learned face detector Detection speed: 30fps
Online Learning a Face Detector • Distribution of weak learners over the cascade:
Concluding remarks • Skewness balancing: • Effective for early boosted classifiers • Better accuracy faster detection speed • Polarity balancing: • Reduction in learning time about 5-30% empirically • Online learning an object detector from an offline counterpart: • Worst case: • detection accuracy and speed similar • Average case: • detection speed can be faster (twice as much)
Fast Training and Selection of Haar-like Features using Statistics • ICCV’07 oral paper: • Minh-Tri Pham and Tat-Jen Cham. Fast Training and Selection of Haar Features using Statistics in Boosting-based Face Detection. In Proc. International Conference on on Computer Vision (ICCV), Rio de Janeiro, Brazil, 2007. • Won Travel Grant Award • Won Second Prize, Best Student Paper in Year 2007 Award, Pattern Recognition and Machine Intelligence Association (PREMIA), Singapore
Motivation • Face detectors today • Real-time detection speed …but… • Weeks of training time
Why is Training so Slow? • Time complexity: O(MNT log N) • 15ms to train a feature classifier • 10 minutes to train a weak classifier • 27 days to train a face detector
Why Should the Training Time be Improved? • Tradeoff between time and generalization • E.g. training 100 times slower if we increase both N and T by 10 times • Trial and error to find key parameters for training • Much longer training time needed • Online-learning face detectors have the same problem
Existing Approaches to Reduce the Training Time • Sub-sample Haar-like feature set • Simple but loses generalization • Use histograms and real-valued boosting (B. Wu et. al. ‘04) • Pro: Reduce from O(MNT log N) to O(MNT) • Con: Raise overfitting concerns: • Real AdaBoost not known to be overfitting resistant • Weak classifier may overfit if too many histogram bins are used • Pre-compute feature values’ sorting orders (J. Wu et. al. ‘07) • Pro: Reduce from O(MNT log N) to O(MNT) • Con: Require huge memory storage • For N = 10,000 and T = 40,000, a total of 800MB is needed.
Why is Training so Slow? • Time complexity: O(MNT log N) • 15ms to train a feature classifier • 10min to train a weak classifier • 27 days to train a face detector • Bottleneck: • At least O(NT)to train a weak classifier • Can we avoid O(NT)?
Our Proposal • Fast StatBoost: To train feature classifiers using statisticsrather than using input data • Con: • Less accurate … but not critical for a feature classifier • Pro: • Much faster training time: • Constant time instead of linear time
Fast StatBoost Non-face Face • Training feature classifiers using statistics: • Assumption: feature value v(t) is normally distributed given face class c is known • Closed-form solution for optimal threshold • Fast linear projectionsof the statistics of a window’s integral image into 1D statistics of a feature value Optimal threshold Feature value : mean and variance of feature value v(t) : random vector representing a window’s integral image : mean vector and covariance matrix of : Haar-like feature, a sparse vector with less than 20 non-zero elements constant time to train a feature classifier
Fast StatBoost • Integral image’s statistics are obtained directly from the weighted input data • Input: N training integral images and their current weights w(m): • We compute: • Sample total weight: • Sample mean vector: • Sample covariance matrix:
Fast StatBoost • To train a weak classifier: • Extract the class-conditional integral image statistics • Time complexity: O(Nd2) • Factor d2 negligible because fast algorithms exist, hence in practice: O(N) • Train T feature classifiers by projecting the statistics into 1D: • Time complexity: O(T) • Select the best feature classifier • Time complexity: O(T) • Time complexity: O(N+T)
(3) (4) (5) (6) (17) (7) Experimental Results Edge features: Corner features: • Setup • Intel Pentium IV 2.8GHz • 19 types 295,920 Haar-like features • Time for extracting the statistics: • Main factor: covariance matrices • GotoBLAS: 0.49 seconds per matrix • Time for training T features: • 2.1 seconds (1) (2) Diagonal line features: (10) (11) (12) (13) (8) (9) Line features: Center-surround features: (15) (18) (19) (14) Nineteen feature types used in our experiments (16) • Total training time: 3.1 secondsper weak classifier with 300K features • Existing methods: 1-10 minutes with 40K features or fewer
Experimental Results • Comparison with Fast AdaBoost (J. Wu et. al. ‘07), the fastest known implementation of Viola-Jones’ framework:
Experimental Results • Performance of a cascade: ROC curves of the final cascades for face detection
Conclusions • Fast StatBoost: use of statistics instead of input data to train feature classifiers • Time: • Reduction of the face detector training time from up to a month to 3 hours • Significant gain in both N and T with little increase in training time • Due to O(N+T) per weak classifier • Accuracy: • Even better accuracy for face detector • Due to much more members of Haar-like features explored
Detection with Multi-exit Asymmetric Boosting • CVPR’08 poster paper: • Minh-Tri Pham and Viet-Dung D. Hoang and Tat-Jen Cham. Detection with Multi-exit Asymmetric Boosting. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska, 2008. • Won Travel Grant Award
Problem overview pass pass pass pass • Common appearance-based approach: • F1, F2, …, FN : boosted classifiers • f1,1, f1,2, …, f1,K : weak classifiers • : threshold F1 F2 FN object …. reject reject reject non-object F1 yes + + + > ? f1,K pass …. f1,1 f1,2 no reject
Objective • Find f1,1, f1,2, …, f1,K, and such that: • K is minimized proportional to F1’s evaluation time F1 yes + + + > ? f1,K pass …. f1,1 f1,2 no reject
Existing trends (1) Idea • For k from 1 until convergence: • Let • Learn new weak classifier f1,k(x): • Let • Adjust to see if we can achieve FAR(F1) <= 0 and FRR(F1) <= 0: • Break loop if such exists Issues • Weak classifiers are sub-optimalw.r.t. training goal. • Too many weak classifiers are required in practice.
Existing trends (2) Idea • For k from 1 until convergence: • Let • Learn new weak classifier f1,k(x): • Break loop if FAR(F1) <= 0 and FRR(F1) <= 0 Pros • Reduce FRR at the cost of increasing FAR – acceptable for cascades • Fewer weak classifiers Cons • How to choose ? • Much longer training time Solution to con • Trial and error: • choose such that K is minimized.
Our solution Why? Learn every weak classifier using the same asymmetric goal: where
Because… FAR FAR (1) 1 1 (2) • Consider two desired bounds (or targets) for learning a boosted classifier • Exact bound: and • Conservative bound: • (2) is more conservative than (1) because (2) => (1). H1 = 0/0 = 1 H2 H3 H1 H4 H2 exact bound conservative bound exact bound conservative bound Q2 H39 0 0 H3 H40 Q1 Q1 Q3 Q4 Q2 Q3 H200 H201 H41 Q39 Q200 Q40 At for every new weak classifier learned, the ROC operating point moves the fastest toward the conservative bound Q201 Q41 0 b0 b0 1 FRR 0 1 FRR
Multi-exit Boosting A method to train a single boosted classifier with multiple exit nodes: : a weak classifier: a weak classifier followed by a decision to continue or reject – an exit node + + + + + + + f1 f2 f3 f4 f5 f6 f7 f8 object pass pass pass F2 F3 F1 reject reject reject non-obj fi fi • Features: • Weak classifiers are trained with the same goal: • Every pass/reject decision is guaranteed with and • The classifier is a cascade. • Score is propagated from one node to another. • Main advantages: • Weak classifiers are learned (approximately) optimally. • No training of multiple boosted classifiers. • Much fewer weak classifiers are needed than traditional cascades.
ResultsGoal () vs. Number of weak classifiers (K) • Toy problem:To learn a (single-exit) boosted classifier F for classifying face/non-face patches such that FAR(F) < 0.8 and FRR(F) < 0.01 • Best goal: • Ours chooses: • Similar results were obtained for tests on other desired error rates.
Ours vs. Others (in Face Detection) • Use Fast StatBoost as base method for fast-training a weak classifier.
Ours vs. Others (in Face Detection) • MIT+CMU Frontal Face Test set:
Conclusion • Multi-exit Asymmetric Boosting trains every weak classifier approximately optimally. • Better accuracy • Much fewer weak classifiers • Significantly reduces training time • No more trial-and-error for training a boosted classifier
Margin-based Bounds on an Asymmetric Error of a Classifier • CVPR’08 poster paper: • Minh-Tri Pham and Viet-Dung D. Hoang and Tat-Jen Cham. Detection with Multi-exit Asymmetric Boosting. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska, 2008. • Won Travel Grant Award
Motivation • A number of cost-sensitive learning methods have been proposed to deal with binary classification with an imbalanced dataset: • Cost-sensitive decision trees (Knoll et. al. ‘94) • Cost-sensitive neural networks (Kuka and Kononenko ‘98) • Imbalanced SVM (Veropoulos et. al. ‘99) • Asymmetric Boosting (Karakoulas and Taylor ‘99, Viola and Jones ‘02) • Their objective function has the same form of an asymmetric error: • where is the prediction input x are given • Bounds on the generalization error of a classifier exist, but bounds on an asymmetric error have not been proposedyet. FAR FRR
Why bounding an Asymmetric Error? • Generalization error is a special caseof an asymmetric error. • Consider , we get: • It helps to explain how the classifier performs on unknown data in problems with imbalanced prior probabilities.
This work’s contribution... • To give bounds on an asymmetric error of a binary classifier:
Summary • Online Asymmetric Boosting • Integrates Asymmetric Boosting with Online Learning • Fast Training and Selection of Haar-like Features using Statistics • Dramatically reduce training time from weeks to a few hours • Multi-exit Asymmetric Boosting • Approximately minimizes the number of weak classifiers
Online Asymmetric Boosting CVPR’07 oral paper: Minh-Tri Pham and Tat-Jen Cham. Online Learning Asymmetric Boosted Classifiers for Object Detection. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, 2007.