210 likes | 392 Views
Stacked Generalization an overview of the paper by David H. Wolpert. Jim Ries JimR@acm.org CECS 477 : Neural Networks February 2, 2000. Introduction. Published in “Neural Networks” 1992 David H. Wolpert - Postdoctoral Fellow Sante Fe Institute. Previously Los Alamos. Degrees in Physics.
E N D
Stacked Generalizationan overview of the paper by David H. Wolpert Jim Ries JimR@acm.org CECS 477 : Neural Networks February 2, 2000
Introduction • Published in “Neural Networks” 1992 • David H. Wolpert - Postdoctoral Fellow Sante Fe Institute. Previously Los Alamos. Degrees in Physics.
Introduction (cont.) • A “generalizer” is an algorithm which guesses a parent function based on a learning set read from the parent function. • Neural networks are a subset of generalizers. • Other generalizers exist, such as Bayesian classifiers.
Introduction (cont.) • “Stack generalization” is a mechanism for minimizing the error rate of one or more generalizers. • Can be used to combine generalizers that have been taught part of the learning set. • More sophisticated version of cross-validation (testing generalizers against previously unseen training data).
Introduction (cont.) • General Idea: • Partition learning set. • Train on one part. • Observe behavior on the other part of the partition. • Correct for biases.
Topics of Discussion • Existing “winner take all” alternatives. • Detailed description of stacked generalization. • Discuss an experiment using stacked generalization. • Variations and Extensions. • Concluding Thoughts
Existing “Winner-Takes-All” Strategies • Cross-validation & generalized cross-validation. • Bootstrapping. • Given a set of candidate generalizers {Gj}, these techniques choose the best G{Gj} s.t. estimated errors are minimized.
Stacked Generalization • Generalizer definition • maps learning set {xk Rn, yk R} together with a question q Rn into a guess R. • If the generalizer returns the correct yi when q is one of the xi, then it reproduces the learning set.
Stacked Generalization (cont.) • Split learning set Rn+1 into 2 (disjoint) sets, i1 and i2 called partition sets. • Cross-validation takes a set of candidate generalizers {Gj} trained from i1 and chooses the candidate with the least error when fed the test partition set i2.
Stacked Generalization (cont.) • Stacked Generalization combines all of the generalizers rather than choosing a “best” one. • Case of 1 generalizer is still interesting in that stacked generalization is essentially a guard against over-fitting.
Stacked Generalization (cont.) • Define Rn+1 in which lives as “level 0”, and any generalizer of as a “level 0” generalizer. • Look at a set of k numbers determined by the N {Gj} generalizers working together within each partition.
Stacked Generalization (cont.) • Consider each set of k as the input part of a point in Rk+1 (“level 1” space). • Generalize from by operating a generalizer in the level 1 space. • Thus, we have a “stack” of generalizers. • More than 2 level stacks are possible.
Stacked Generalization (cont.) • Problem: What generalizer(s) to use at each level? (unanswered)
Stacked Generalization (cont.) Multiple Generalizers: 1) Creating L’ Level 1/ Learning set L’. Contains r elements, one for each partition in the level 0 partition set. L’ output L’ input Level 0/ Learning set . Partition set ij. Generalizers {Gp}. G1(ij; in(i2)) G2(ij; in(i2)) ... out(i2))
L’ outputs L’ inputs Stacked Generalization (cont.) Multiple Generalizers: 2) Guessing Final guess Level 1/ Learning set L’. Generalizer G’. Question q’.. G’(L’;q’) Q’, the level 1 question Level 0/ Learning set . Generalizers {Gp}. Question q. G1(; q) G2(; q) ...
Experiment • NETtalk “reading aloud” problem. • 7 letters as input. • Output is an English phoneme that a human would utter if reading aloud. • Several separate generalizers combined.
Experiment (cont.) • Best level 0 generalizer got 69% correct. • Stacked generalization got 88% correct.
Variations and Extensions • Consider level 1 output not as a guess but as an estimate of the error of a guess. (can be tweaked by using a constant to denote what percentage of error to be considered) • Consider the entire stacked generalization as a generalization which can be stacked.
Concluding Thoughts • Where is the evidence that considering all of when training is not as good as considering part of and then adding another layer using the remaining ? • How does stacked generalization compare to applying heuristics such as regularization or early stopping to avoid over-fitting?