420 likes | 437 Views
Sieci neuronowe – bezmodelowa analiza danych?. K. M. Graczyk IFT, Uniwersytet Wrocławski Poland. Why Neural Networks?. Inspired by C. Giunti (Torino) PDF’s by Neural Network Papers of Forte et al.. ( JHEP 0205:062,200, JHEP 0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:1-63,2009 ).
E N D
Sieci neuronowe – bezmodelowa analiza danych? K. M. Graczyk IFT, Uniwersytet Wrocławski Poland
Why Neural Networks? • Inspired by C. Giunti (Torino) • PDF’s by Neural Network • Papers of Forte et al.. (JHEP 0205:062,200, JHEP 0503:080,2005, JHEP 0703:039,2007, Nucl.Phys.B809:1-63,2009). • A kind of model independent way of fitting data and computing associated uncertainty • Learn, Implement, Publish (LIP rule) • Cooperation with R. Sulej (IPJ, Warszawa) and P. Płoński (Politechnika Warszawska) • NetMaker • GrANNet ;) my own C++ library
Road map • Artificial Neural Networks (NN) – idea • FeedForward NN • PDF’s by NN • Bayesian statistics • Bayesian approach to NN • GrANNet
Inspired by Nature The human brain consists of around 1011 neurons which are highly interconnected with around 1015 connections
Applications • Function approximation, or regression analysis, including time series prediction, fitness approximation and modeling. • Classification, including pattern and sequence recognition, novelty detection and sequential decision making. • Data processing, including filtering, clustering, blind source separation and compression. • Robotics, including directing manipulators, Computer numerical control.
Output, target Input layer Hidden layer Feed Forward Artificial Neural Network the simplest example Linear Activation Functions Matrix
weights i-th perceptron activation function output Summing input threshold
sigmoid qth tanh(x) threshol activation functions • Heavside function q(x) • 0 or 1 signal • sigmoid function • tanh() • linear signal is amplified Signal is weaker
architecture • 3 -layers network, two hidden: • 1:2:1:1 • 2+2+1 + 1+2+1: #par=9: • Bias neurons, instead of thresholds • Signal One F(x) x Linear Function Symmetric Sigmoid Function
Neural Networks – Function Approximation • The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions. (Wikipedia.org)
Q2 Q2 F2 s e x A map from one vector space to another
Supervised Learning • Propose the Error Function • in principle any continuous function which has a global minimum • Motivated by Statistics: Standard Error Function, chi2, etc, … • Consider set of the data • Train given NN by showing the data marginalize the error function • back propagation algorithms • An iterative procedure which fixes weights
Learning Algorithms • Gradient Algorithms • Gradient descent • RPROP (Ridmiller & Braun) • Conjugate gradients • Look at curvature • QuickProp (Fahlman) • Levenberg-Marquardt (hessian) • Newtonian method (hessian) • Monte Carlo algorithms (based on Marcov chain algorithm)
Overfitting • More complex models describe data in better way, but lost generalities • bias-variance trade-off • Overfitting large values of the weights • Compare with the test set (must be twice larger than original) • Regularization additional penalty term to error function Decay rate
Data Still Moreprecise than Theory • PDF Nature Observation Measurements Physics given directly by the data Idea Statistics Data free parameters Most of Models model QCD nonoperative Nonparametric QED What about physics Problems Some general constraints Model Independent Analysis Statistical Model data Uncertainty of the predictions
Fitting data with Artificial Neural Networks ‘The goal of the network training is not to learn on exact representation of the training data itself, but rather to built statistical model for the process which generates the data’ C. Bishop, ‘Neural Networks for Pattern Recognition’
Q2 F2 x Parton Distribution Function with NN Some method but…
Parton Distributions Functions S. Forte, L. Garrido, J. I. Latorre and A. Piccione, JHEP 0205 (2002) 062 • A kind of model independent analysis of the data • Construction of the probability density P[G(Q2)] in the space of the structure functions • In practice only one Neural Network architecture • Probability density in the space of parameters of one particular NN But in reality Forte at al.. did
Generating Monte Carlo pseudo data The idea comes from W. T. Giele and S. Keller Training Nrep neural networks, one for each set of Ndat pseudo-data The Nrep trained neural networks provide a representation of the probability measure in the space of the structure functions
uncertainty correlation
short enough long too long 30 data points, overfitting
My criticism • The simultaneous use of artificial data and chi2 error function overestimates uncertainty? • Do not discuss other NN architectures • Problems with overfitting (a need of test set) • Relatively simple approach, comparing with the present techniques in NN computing. • The uncertainty of the model predictions must be generated by the probability distribution obtained for the model then the data itself
GraNNet – Why? • I stole some ideas from FANN • C++ Library, easy in use • User defined Error Function (any you wish) • Easy access to units and their weights • Several ways for initiating network of given architecture • Bayesin learning • Main objects: • Classes: NeuralNetwork, Unit • Learning algorithms: so far QuickProp, Rprop+, Rprop-, iRprop-, iRprop+,…, • Network Response Uncertainty (based on Hessian) • Some restarting and stopping simple solutions
Structure of GraNNet • Libraries: • Unit class • Neural_Network class • Activation (activation and error function structures) • Learning algorithms • RProp+, RProp-, iRProp+, RProp-, Quickprop, Backprop • generatormt • TNT inverse matrix package
Bayesian Approach ‘common sense reduced to calculations’
Bayesian Framework for BackProp NN, MacKay, Bishop,… • Objective Criteria for comparing alternative network solutions, in particular with different architectures • Objective criteria for setting decay rate a • Objective choice of regularizing function Ew • Comparing with test data is not required.
Data point, vector input, vector Network response Data set Number of data points Number of data weights Notation and Conventions
Probability of D given Hi Normalizing constatnt Model Classification • A collection of models, H1, H2, …, Hk • We believe that models are classified by P(H1), P(H2), …, P(Hk) (sum to 1) • After observing data D Bayes’ rule • Usually at the beginning P(H1)=P(H2)= …=P(Hk)
Single Model Statistics • Assume that model Hi is the correct one • The neural network A with weights w is considered • Task 1: Assuming some prior probability of w, after including data, construct Posterior • Task 2: consider the space of hypothesis and construct evidence for them
wMP Constructing prior and posterior functions Weight distribution!!! likelihood Prior Posterior probability w0
Computing Posterior hessian Covariance matrix
How to fix proper a? • Two ideas: • Evidence Approximation (MacKay) • Hierarchical • Find wMP • Find aMP • Perform analytically integrals over a If sharply peaked!!!
Getting aMP The effective number of well-determined parameters Iterative procedure during training
Bayesian Model Comparison – Occam Factor Occam Factor • The log of Occam Factor amount of • Information we gain after data have arrived • Large Occam factor complex models • larger accessible phase space (larger range of posterior) • Small Occam factor simple models • small accessible phase space (larger range of posterior) Best fit likelihood
Q2 Misfit of the interpolant data x Occam Factor – Penalty Term Symmetry Factor F2 Tanh(.) change w sign Evidence
Network 121 preferred by data Occam hill