270 likes | 418 Views
Defeating the Black Box – Neural Networks in HEP Data Analysis. Jan Therhaag (University of Bonn) TMVA Workshop @ CERN, January 21 st , 2011. T MVA on the web: http:// tmva.sourceforge.net /. TexPoint fonts used in EMF.
E N D
Defeating the Black Box – Neural Networks in HEP Data Analysis Jan Therhaag (University of Bonn) TMVA Workshop @ CERN, January 21st, 2011 TMVA on the web: http://tmva.sourceforge.net/ TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA
A simple approach: • Code the classes as a binary variable (here: blue= 0,orange= 1) • Perform a linear fit to this discrete function • Define the decision boundary by • //###################################################################################### //TMVA code //###################################################################################### //create Factory TMVA::Factory *factory = new TMVA::Factory(“TMVAClassification”,outputfile,”AnalysisType=Classification”) factory->AddVariable(“x1”,”F”); • factory->AddVariable(“x2”,”F”); • //book linear discriminant classifier (LD) • factory->BookMethod(TMVA::Types::kLD,”LD”); • Factory->TrainAllMethods(); • Factory->TestAllMethods(); • Factory->EvaluateAllMethods();
Now consider the sigmoid transformation: • has values in [0,1] and can be interpreted as the probability p(orange | x) (then obviously p(blue| x) = 1- p(orange | x) = )
We have just invented the neuron! • is called the activity of the neuron, while is called the activation
The training proceeds via minimization of the error function • The neuron learns via gradient descent* • Examples may be learned one-by-one (online learning) or all at once (batch learning) • Overtraining may occur! *more sophisticated techniques may be used
The class of networks used for regression and classification tasks is called feedforward networks • Neurons are organized in layers • The output of a neuron in one layer becomes the input for the neurons in the next layer • //###################################################################################### //TMVA code //###################################################################################### //create Factory TMVA::Factory *factory = new TMVA::Factory(“TMVAClassification”,outputfile,”AnalysisType=Classification”) factory->AddVariable(“x1”,”F”); • factory->AddVariable(“x2”,”F”); • //book Multi Layer Perceptron(MLP) network and definde network architecture • factory->BookMethod(TMVA::Types::kMLP,”MLP”,”NeuronType=sigmoid:HiddenLayers=N+5,N”); • Factory->TrainAllMethods(); • Factory->TestAllMethods(); • Factory->EvaluateAllMethods();
training data • Feedforward networks are universal approximators • Any continuous function can be approximated with arbitratry precision • The complexity of the output function is determined by the number of hidden units and the characteristic magnitude of the weights
From neuron training to network training - backpropagation • In order to find the optimal set of weights w, we have to calculate the derivatives • Recall the single neuron: • It turns out that: with for output neurons and else While input information is always propagated forward, errors are propagated backwards!
Some issues in network training • The error function has several minima, the result of the minimization typically depends on the starting values of the weights • The scaling of the inputs has an effect on the final solution • //###################################################################################### //TMVA code //###################################################################################### //create Factory TMVA::Factory *factory = new TMVA::Factory(“TMVAClassification”,outputfile,”AnalysisType=Classification”) factory->AddVariable(“x1”,”F”); • factory->AddVariable(“x2”,”F”); • //book Multi Layer Perceptron(MLP) network with normalized input distributions • factory->BookMethod(TMVA::Types::kMLP,”MLP”,”RandomSeed=1:VarTransform=N”); • Factory->TrainAllMethods(); • Factory->TestAllMethods(); • Factory->EvaluateAllMethods(); • Overtraining • bad generalization and overconfident predictions NN with 10 hidden units
Regularization and early stopping • Early stopping: Stopping the training before the minimum of E(w) is reached • a validation data set is needed • convergence is monitored in TMVA • Weight decay: Penalize large weights explicitly • //###################################################################################### //TMVA code //###################################################################################### //create Factory TMVA::Factory *factory = new TMVA::Factory(“TMVAClassification”,outputfile,”AnalysisType=Classification”) factory->AddVariable(“x1”,”F”); • factory->AddVariable(“x2”,”F”); • //book Multi Layer Perceptron(MLP) network with regulariaztion • factory->BookMethod(TMVA::Types::kMLP,”MLP”,”NCycles=500:UseRegulator”); • Factory->TrainAllMethods(); • Factory->TestAllMethods(); • Factory->EvaluateAllMethods(); NN with 10 hidden units and λ=0.02
Network complexity vs. regularization • Unless prohibited by computing power, a large number of hidden units H is to be preferred • no ad hoc limitation of the model • In the limits of , network complexity is entirely determined by the typical size of the weights Output
Advanced Topics Network learning as inference and Bayesian neural networks
Network training as inference • Reminder: Given the network output , the error function is just minus the log likelihood of the training data D • Similarly, we can interpret the weight decay term as a log probability distribution for w • Obviously, there is a close connection between the regularized error function and the inference for the network parameters likelihood prior normalization
Predictions and confidence • Minimizing the error corresponds to finding the most probable value which is used to make predictions • Problem: Predictions for points in regions less populated by the training data may be to confident Can we do better?
Using the posterior to make predictions • Instead of using , we can also exploit the full information in the posterior
Using the posterior to make predictions • Instead of using , we can also exploit the full information in the posterior See Jiahang’s talk this afternoon for details of the Bayesian approach to NN in the TMVA framework!
A full Bayesian treatment • In a full Bayesian framework, the hyperparameter(s) λareestimatedfromthedatabymaximizingtheevidence • notestdatasetisneeded • neuralnetworktunesitself • relevanceofinput variables canbetested (automaticrelevancedetermination ARD) • Simultaneous optimization of parameters and hyperparameters is technically challenging • TMVA uses a clever approximation model complexity model complexity
Summary (1) * A neuron can be understood as an extension of a linear classifier* A neural net consists of layers of neurons, input information always propagates forward, errors propagate backwards* Feedforward networks are universal approximators* The model complexity is governed by the typical weight size, which can be controlled by weight decay or early stopping* In the Bayesian framework, error minimization corresponds to inference and regularization corresponds to the choice of a prior for the parameters* The Bayesian approach makes use to the full posterior and gives better predictive power* The amount of regularization can be learned from the data by maximizing the evidence
Summary (2) Current features of the TMVA MLP:* Support for regression, binary and multiclass classification (new in 4.1.0 !)* Efficient optional preprocessing (Gaussianization, normalization) of the input distributions * Optional regularization to prevent overtraining+ efficient approximation of the posterior distribution of the network weights + self adapting regulator + error estimationFuture development in TMVA:* Automatic relevance determination for input variables * Extended automatic model (network architecture) comparison Thank you!
References Figures taken from:David MacKay: “Information Theory, Inference and Learning Algorithms”Cambridge University Press 2003Christopher Bishop: “Pattern Recognition and Machine Learning”Springer 2006Hastie, Tibshirani, Friedman: “The Elements of Statistical Learning”, 2nd Ed.Springer 2009These books are also recommended for further reading on neural networks