150 likes | 274 Views
Competent Undemocratic Committees. W ł odzis ł aw Duch , Łukasz Itert and K arol Grudziński Department of Informatics, Nicholas Copernicus University, Torun, Poland. http://www.phys.uni.torun.pl/kmk. Motivation.
E N D
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun, Poland. http://www.phys.uni.torun.pl/kmk
Motivation Combining information from different models is known as: ensemble learning, mixture of experts, voting classification algorithms, or committees of models: Important and popular subject in machine learning, with conferences and special issues of journals. Useful for solving real problems, such as predicting the glucose levels of diabetic patients – many classifiers are available for this problem; 10 heads are wiser than one? Committees: 1) improve the accuracy of a single model2) decrease the variance, stabilizing the results.
Variability Committees need different models. Variability of committee models comes from: 1) Different samples taken from the same data Crossvalidation training, boosting, bagging, arcing … Bagging: train on bootstrap samples, randomly draw a fixed number of training data vectors from the pool containing all training vectors. AdaBoost (Adaptive Boosting): assign weights to training instances, higher for incorrectly classified. Arcing: simplified weighting of the training vectors. 2) Bias of models, due to the change of their complexity. The number of neurons, training parameters, pruning ...
Voting Let P(Ci|X;Ml), be the posterior probability estimation for l=1..m models for i=1..K classes. How to determine the committee decision? • Majority voting – go with the crowd. • Average results of all models. • Select a model that gives the largest probability (highest confidence). • Set a threshold to select models with highest confidence and use majority voting for these models. • Make linear combination of results.
More on voting Each model does not need to be accurate for all data, but should account well for a different subset of data. • Krogh and Vedelsby: generalization error is small if highly accurate classifiers disagreeing with each other are used. • Xin Yao: diversify models, create negative correlation between individual models and average results (no GA). • Jacobs: mixture of experts, neural architecture with a gating network to select the most competent model. • Ortega et al: a ``referee meta-model'' deciding which model should contribute to the final decision.
Competent Models: idea Democratic voting: all models always try to contribute. Undemocratic voting: only experts on local issues should vote. For each model identify feature space regions where it is incompetent. Use penalty factor to decrease the influence of incompetent model during voting. • Biological inspiration: • Only a small subset of cortical modules are used by the brain for a given task. • Incompetent modules are inhibited by rough evaluation of inputs at the thalamic level. • Conclusion: weights should be input dependent!
Competent Models: idea Linear meta-model gives m additional parameters: Models that have small weights may still be useful in some areas of the feature space. Use input-dependent weights: to inhibit voting of the model Ml for class Ci around specific X. Similarity Based Models use reference vectors; determine the areas of the input space where a given model is competent (makes a few errors) and where it fails.
Committees of Competent Models • Optimize parameters for all models Ml, l = 1...mon the training set using a cross-validation procedure. • For each model l = 1...m: • for all training vectors Ri generate predicted classes Cl(Ri); • if Cl(Ri)C(Ri), i.e. model Ml makes an error for vectorRi , determine the area of incompetence of the model, finding the distance di,j to the nearest vector that Ml has correctly classified; • set parameters of the incompetence factor F(||XRi||;Ml) in such a way that its value decreases significantly for ||XRi||di,j/2 • The incompetence function for the model F(X;Ml)is a product of factors F(||XRi||;Ml) for all training vectors that have been incorrectly handled
CCM • F(||XRi||;Ml) examples: • Gaussian functionF(||XRi||;Ml)=1 G(||XRi||a;i), whereacoefficient is used to flatten the function. • F(||XRi||;Ml)=1/(1+||XRi||-a), similar to Gaussian • Sum of two logistic functions(||XRi|| di,j/2) + (||XRi|| di,j/2) Vectors that cannot be correctly classified show up as errors that all model make, but some vectors that are erroneously classified by one model may be correctly handled by another.
CCM voting Use confidence factors to modify the linear weights. Confidence factors are products over local regions.
Numerical experiment: data Dataset: Telugu vowel data, 871 vectors, 3 features (dominant formants), 6 classes (vowels) [1] Pal, S.K. and Mitra S. (1999) Neuro-Fuzzy Pattern Recognition. J. Wiley, New York • Models included in the committee: • k=10, Euclidean (M1) • k=13, Manhattan (M2) • k=5, Euclidean (M3) • k=5, Manhattan (M4)
Accuracy of models Accuracy of all models for each class, in %:
Comparison of result Dataset: Telugu vowel (871 vectors, 3 features)
Comparison of committees results Results for Telugu vowel data:
Conclusions • Assigning competence factors in various voting procedures is an attractive idea. • Learning becomes modular: each model specializes in different subproblems. Some ideas: • Combine DT, kNN, NN models. • Use CCM with adaptive boosting. • Use ROC curves to increase the AUC area by providing a convex combination of individual ROC curves. • Diversify models by adding explicit negative correlation. • Use constructive approach: add new models to committee that classify correctly remaining vectors.