290 likes | 404 Views
Smooth Boosting By Using An Information-Based Criterion. Kohei Hatano Kyushu University, JAPAN. Organization of this talk. Introduction Preliminaries Our booster Experiments Summary. Lionel Barrymore (her granduncle). Drew?. Barrymore?. Charlie’s engels?. John Drew Barrymore
E N D
Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN
Organization of this talk • Introduction • Preliminaries • Our booster • Experiments • Summary
Lionel Barrymore (her granduncle) Drew? Barrymore? Charlie’s engels? John Drew Barrymore (her father) Barrymore? y y y n n n y n YES YES YES NO NO NO No Yes Jaid Barrymore (her mother) Labeled training data (web pages) John Barrymore (her grandpa) Diana Barrymore (her aunt) Boosting • Methodology to combine prediction rules into a more accurate one . E.g. learning rule to classify web pages on “Drew Barrymore” Set of pred. rules = words accuracy 51%! combination of prediction rules (say, majority vote) + + “The Barrymore family” of Hollywood accuracy 80%
(Huge) data boosting algorithm sample randomly accept reject Boosting by filtering [Schapire90], [Freund 95], Boosting scheme that uses random samplingfrom data Advantage 1: can determine sample size adaptively Advantage 2: smaller spacecomplexity (for sample) batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)
Some known results Boosting algorithms by filtering • Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund 95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03]. • Criterion for choosing prediction rules: accuracy Are there any better criteria? A candidate: information-based criterion • Real AdaBoost [Schapire&Singer 99],InfoBoost [Aslam 00] (a simple version of Real AdaBoost) • Criterion for choosing prediction rules: mutual information • sometimes faster than those using accuracy-based criterion Experimental:[Schapire&Singer 99],Theoretical:[Hatano&Warmuth 03], [Hatano&Watanabe 04] • However, no boosting algorithm by filtering known
Our work Boosting by filtering Information-based criterion lower space complexity faster convergence efficient boosting by filtering using an information-based criterion our work
Introduction • Preliminaries • Our booster • Experiments • Summary
: correct : wrong h1 +1 -1 Illustration of general boosting lower higher 1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1. 0.25 2. Assign a coefficient to h1 based on its quality. 3. Update the distribution.
: correct : wrong h2 +1 -1 Illustration of general boosting(2) higher lower 1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2. 0.28 2. Assign a coefficient to h1 based on its weighted error. 3. Update the distribution. Repeat these procedure for T times
h2 h3 h1 +1 +1 +1 -1 -1 -1 Illustration of general boosting(3) Final pred. rule = weighted majority vote of chosen pred. rules. 0.28 0.05 instance x 0.25 + + predict +1, if H(x) >0 predict -1, otherwise
Example: AdaBoost [Freund&Schapire 97] Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) correct wrong Difficult examples (possibly noisy) may have too much weights
Smooth boosting supxDt(x)/D1(x) is poly-bounded • Keeping the distribution “smooth” poly D1 Dt (distribution costructed by the booster) D1 (original distribution, e.g. uniform) • makes boosting algorithms • noise-tolerant • (statistical query model) MadaBoost [Domingo&Watanabe00] • (malicious noise model ) SmoothBoost [Servedio01] • (agnostic boosting model)AdaFlat [Gavinsky 03] , • sampling from Dt can be simulated efficiently • via sampling from D1 (e.g., by rejection sampling). • applicable in the boosting by filtering framework
Example: MadaBoost [Domingo & Watanabe 00] l(-yiHt(xi)) Criterion for choosing pred. rules (edge) Coefficient Update -yiHt(xi) Dt is 1/-bounded ( : error of Ht)
Examples of other smooth boosters LogitBoost [Freidman, et al 00] AdaFlat [Gavinsky 03] logistic function stepwise linear function
Introduction • Preliminaries • Our booster • Experiments • Summary
Our new booster l(-yiHt(xi)) Criterion for choosing pred. rules (pseudo gain) Coefficient Update -yiHt(xi) Still, Dt is 1/-bounded ( : error of Ht)
Pseudo gain Relation to edge Property: 2 (by convexity of of the square function)
Interpretation of pseudo gain minh(conditional entropy of labels given ht) maxh(mutual information between h and labels) the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index but, ・・・
Information-based criteria [Kearns & Mansour 98] Our booster choosesa pred. rule maximizing the mutual information defined by Gini Index (GiniBoost) Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy. Good news: Gini index can be estimated via sampling efficiently!
Convergence of train. error (GiniBoost) Thm. Suppose that (train. error of Ht)> for t=1,…,T. Then Coro. Further, if t (ht)¸, train.err(HT) · in T= O(1/) steps.
Comparison on convergence speed : minimum pseudo gain : minimum edge
Boosting- by- filtering version of GiniBoost (outline) • Multiplicative bounds for pseudo gain (and more practical bounds using the central limit approximation). • Adaptive pred. rule selector. • Boosting alg. in the PAC learning sense.
Introduction • Preliminaries • Our booster • Experiments • Summary
Experiments • Topic classification of Reuters news (Reuters-21578) • Binary classification for each 5 topics (Results are averaged). • 10,000 examples. • 30,000 words used as base pred. rules. • Run algorithms until they sample 1,000,000 examples in total. • 10-fold CV.
Test error over Reuters Note: GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost
Execution time faster by about 4 times! (Cf. similar result w/o samplingRealAdaBoost[Schapire & Singer 99])
Introduction • Preliminaries • Our booster • Experiments • Summary
Summary/Open problem Summary GiniBoost: • uses pseudo gain (Gini index) to choose base prediction rules • shows faster convergence in the filtering scheme. • Open problem • Theoretical analysis on noise-tolerance
Comparison on sample size Observation: smaller accepted examples→ faster selection of pred. rules