1 / 18

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees. Radford M. Neal and Jianguo Zhang the winners of NIPS2003 feature selection challenge University of Toronto. The results.

jed
Download Presentation

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners of NIPS2003 feature selection challenge University of Toronto

  2. The results • Combination of Bayesian neural networks and classification based on Bayesian clustering with a Dirichlet diffusion tree model. • A Dirichlet diffusion tree method is used for Arcene. • Bayesian neural networks (as in BayesNN-large) are used for Gisette, Dexter, and Dorothea. • For Madelon, the class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.

  3. Their General Approach • Use simple techniques to reduce the computational difficulty of the problem, then apply more sophisticated Bayesian methods. • The simple techniques: PCA and feature selection by significance tests. • Bayesian neural networks. • Automatic Relevance Determination.

  4. (I) First level feature reduction

  5. Feature selection using significance tests (first level) • An initial feature subset was found by simple univariate significance tests. (correlation coefficient, symmetrical uncertainty ) • Assumption: Relevant variables will be at least somewhat relevant on their own. • For all tests, a p-value was found by comparing to the distribution found when permuting the class labels.

  6. Dimensionality reduction with PCA (an alternative for FS) • There are probably better dimensionality reduction methods than PCA, but that’s what we used. One reason is that it’s feasible even when p is huge, provided n is not too large - time required is of order min(pn2, np2). • PCA was done using all the data (training, validation, and test).

  7. (II) Building learning model & Second level feature Selection

  8. Bayesian Neural Networks

  9. Conventional neural network learning

  10. Bayesian Neural Network Learning • Based on the statistic interpretation of the conventional neural network learning

  11. Bayesian Neural Network Learning • Bayesian predictions are found by integration rather than maximization. For a test case x, y is predicted: • Conventional neural network only consider parameters with maximum posterior • Bayesian Neural Network consider all possible parameters in the parameter space. • Can be implemented by Gaussian approximation and MCMC

  12. ARD Prior • Still remember the decay? • How? (by optimize the decay parameter) • Associate weights from each input with a decay parameter • There are theories for optimizing the decays. • Result. If an input feature x is irrelevant, its relevance hyper-parameter β=1/a will tend to be small, forcing the relevant weight from that input to be near zero.

  13. Some Strong Points of This Algorithm • Bayesian learning integrates over the posterior distribution for the network parameters, rather than picking a single “optimal” set of parameters. This fartherhelps to avoid overfitting. • ARD can be used to adjust the relevance of input features • We can using prior to incorporate external knowledge

  14. Dirichlet Diffusion Trees • An Bayesian hierarchical clustering method

  15. The methods • BayesNN-small features selected using significance tests. • BayesNN-large principle components • BayesNN-DFT-combo the class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.

  16. About the datasets

  17. The results • http://www.nipsfsc.ecs.soton.ac.uk/

  18. Thanks. Any Question?

More Related