230 likes | 337 Views
Statistical Learning Methods in HEAP. Jens Zimmermann, Christian Kiesling. Max-Planck-Institut für Physik, München MPI für extraterrestrische Physik, München Forschungszentrum Jülich GmbH. Statistical Learning: Introduction with a simple example Occam‘s Razor Decision Trees
E N D
Statistical Learning Methods in HEAP Jens Zimmermann, Christian Kiesling Max-Planck-Institut für Physik, München MPI für extraterrestrische Physik, München Forschungszentrum Jülich GmbH Statistical Learning: Introduction with a simple example Occam‘s Razor Decision Trees Local Density Estimators Methods Based on Linear Separation Examples: Triggers in HEP and Astrophysics Conclusion C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Statistical Learning • Does not use prior knowledge „No theory required“ • Learns only from examples „Trial and error“ „Learning by reinforcement“ • Two classes of statistical learning: discrete output 0/1: „classification“ continuous output: „regression“ • Application in High Energy- and Astro-Physics: Background suppression, purification of events Estimation of parameters not directly measured C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
A simple Example: Preparing a Talk # slides 0 1 2 3 4 5 6 x10 ExperimentalistsTheorists 0 1 2 3 4 5 6 x10 # formulas Data base established by Jens duringYoung Scientists Meeting at MPI C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
# slides 0 1 2 3 4 5 6 x10 0 1 2 3 4 5 6 x10 # formulas Discriminating Theorists from Experimentalists: A First Analysis 0 2 4 6 x10 # formulas 0 2 4 6 x10 # slides Experimentalists Theorists First talks handed in Talks a week beforemeeting C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Simple „model“, but no completeseparation Completely separable, but only via complicated boundary # slides 0 1 2 3 4 5 6 x10 0 1 2 3 4 5 6 x10 # formulas First Problems New talk by Ludger: 28 formulas on 31 slides # slides 0 1 2 3 4 5 6 x10 At this point we cannot know which feature is „real“! Use Train/Test or Cross-Validation! 0 1 2 3 4 5 6 x10 # formulas C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Training Set E Test Set # slides 0 1 2 3 4 5 6 x10 Overtraining Training epochs 0 1 2 3 4 5 6 x10 # formulas See Overtraining - Want Generalization Need Regularization Train Test Want to tune the parameters of the learning algorithm depending on the overtraining seen! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Training Set E Test Set # slides 0 1 2 3 4 5 6 x10 Training epochs Regularization will ensure adequate performance (e.g. VC dimensions):Limit the complexity of the model “Factor 10” - Rule: (“Uncle Bernie’s Rule #2”) 0 1 2 3 4 5 6 x10 # formulas See Overtraining - Want Generalization Need Regularization Train Test C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Yes! But not of much use. No! „No free lunch“-theorem Wolpert 1996 Philosophy: Occam‘s Razor • Pluralitas non est ponenda sine necessitate. • Do not make assumptions, unless they are really • necessary. • From theories which describe the same phenomenon equally well • choose the one which contains the least number of assumptions. 14th century First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself. Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error. C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
# formulas #formulas < 20 exp #formulas > 60 th 0 2 4 6 x10 # slides #slides > 40 exp #slides < 40 th 0 2 4 6 x10 all events Classify Ringaile: 31 formulas on 32 slides #formulas > 60 #formulas < 20 rest th exp subset 20 < #formulas < 60 #slides < 40 #slides > 40 th th exp Decision Trees 20 < #formulas < 60 ? Regularization: Pruning C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Local Density Estimators Search for similar events already classified within specified region, count the members of the two classes in that region. # slides 0 1 2 3 4 5 6 x10 # slides 0 1 2 3 4 5 6 x10 0 1 2 3 4 5 6 x10 # formulas 0 1 2 3 4 5 6 x10 # formulas C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
# formulas # slides 0 2 4 6 x10 0 2 4 6 x10 31 32 out= Maximum Likelihood Regularization: Binning Correlation gets lost completely by projection! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
k=2 out= k=3 out= k=4 out= k=5 out= k-Nearest-Neighbour k=1 out= # slides 0 1 2 3 4 5 6 x10 0 1 2 3 4 5 6 x10 # formulas Regularization: Parameter k For every evaluation position the distances to each training position need to be determined! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
5 4 3 1 y 7 # slides 0 1 2 3 4 5 6 x10 6 5 8 3 x 10 7 8 6 Small box: checked 1,2,4,9 out= 0 1 2 3 4 5 6 x10 # formulas Large box: checked all out= Range Search 1 x 2 3 y y 9 6 4 5 8 x x 10 7 9 2 10 Regularization: Box-Size Tree needs to be traversed only partially if box size is small enough! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
# slides 0 1 2 3 4 5 6 x10 # slides 0 1 2 3 4 5 6 x10 0 1 2 3 4 5 6 x10 # formulas 0 1 2 3 4 5 6 x10 # formulas Methods Based on Linear Separation Divide the input space into regions separated by one or more hyperplanes. Extrapolation is done! LDA (Fisher discr.) C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
1 0 -1.8 +3.6 +3.6 arbitrary inputs and hidden neurons 0 1 2 3 4 5 6 x10 -50 +20 +1.1 -1.1 +0.1 +0.2 # formulas # slides 0 1 2 3 4 5 6 x10 Neural Networks Network with two hidden neurons (gradient descent): Regularization: # hidden neurons weight decay C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Separating hyperplane with maximum distance to each data point: Maximum margin classifier Found by setting up condition for correct classfication and minimizing which leads to the Lagrangian Necessary condition for a minimum is Output becomes No! Replace dot products: The mapping to feature space is hidden in a kernel Non-separable case: Support Vector Machines Only linear separation? C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Physics Applications: Neural Network Trigger at HERA H1 keep physics reject background C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Eff@Rej=95%: NN 99.6% SVM 98.3% k-NN 97.7% RS 97.5% C4.5 97.5% ML 91.2% LDA 82% Trigger for J/y Events H1 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Eff@Rej=80%: NN 74% SVM 73% C4.5 72% RS 72% k-NN 71% LDA 68% ML 65% Triggering Charged Current Events signal background C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Astrophysics: MAGIC - Gamma/Hadron Separation Photon Hadron Training with Data and MC Evaluation with Data vs. s = signal (photon) enhancement factor Random Forest: s = 93.3 Neural Net: s = 96.5 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
transfer direction ~10µm ~300µm electron potential s of reconstruction in µm NN 3.6 SVM 3.6 k-NN 3.7 RS 3.7 ETA 3.9 CCOM 4.0 Future Experiment XEUS: Position of X-ray Photons (Application of Stat. Learning in Regression Problems) XEUS C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
Conclusion • Statistical learning theory is full of subtle details (models statistics) • Widely used statistical learning methods studied: • Decision Trees • LDE: ML, k-NN, RS • Linear separation: LDA, Neural Nets, SVM‘s • Neural Networks found superior in the HEP and Astrophysics applications (classification, regression) studied so far • Further applications (trigger, offline analyses) under study C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003
k-NN RS 2 2 4 3 4 3 2 2 3 3 5 5 5 5 Fit Gauss NN a=s(-2.1x - 1)b=s(+2.1x - 1) out=s(-12.7a-12.7b+9.4) From Classification to Regression C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003